基于跨模态图语言和传感器数据的人-移动交互协同学习

IF 4.5 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Pub Date : 2023-09-27 DOI:10.1145/3610904

Mahan Tabatabaie, Suining He, Kang G. Shin

{"title":"基于跨模态图语言和传感器数据的人-移动交互协同学习","authors":"Mahan Tabatabaie, Suining He, Kang G. Shin","doi":"10.1145/3610904","DOIUrl":null,"url":null,"abstract":"Learning the human--mobility interaction (HMI) on interactive scenes (e.g., how a vehicle turns at an intersection in response to traffic lights and other oncoming vehicles) can enhance the safety, efficiency, and resilience of smart mobility systems (e.g., autonomous vehicles) and many other ubiquitous computing applications. Towards the ubiquitous and understandable HMI learning, this paper considers both \"spoken language\" (e.g., human textual annotations) and \"unspoken language\" (e.g., visual and sensor-based behavioral mobility information related to the HMI scenes) in terms of information modalities from the real-world HMI scenarios. We aim to extract the important but possibly implicit HMI concepts (as the named entities) from the textual annotations (provided by human annotators) through a novel human language and sensor data co-learning design. To this end, we propose CG-HMI, a novel Cross-modality Graph fusion approach for extracting important Human-Mobility Interaction concepts from co-learning of textual annotations as well as the visual and behavioral sensor data. In order to fuse both unspoken and spoken \"languages\", we have designed a unified representation called the human--mobility interaction graph (HMIG) for each modality related to the HMI scenes, i.e., textual annotations, visual video frames, and behavioral sensor time-series (e.g., from the on-board or smartphone inertial measurement units). The nodes of the HMIG in these modalities correspond to the textual words (tokenized for ease of processing) related to HMI concepts, the detected traffic participant/environment categories, and the vehicle maneuver behavior types determined from the behavioral sensor time-series. To extract the inter- and intra-modality semantic correspondences and interactions in the HMIG, we have designed a novel graph interaction fusion approach with differentiable pooling-based graph attention. The resulting graph embeddings are then processed to identify and retrieve the HMI concepts within the annotations, which can benefit the downstream human-computer interaction and ubiquitous computing applications. We have developed and implemented CG-HMI into a system prototype, and performed extensive studies upon three real-world HMI datasets (two on car driving and the third one on e-scooter riding). We have corroborated the excellent performance (on average 13.11% higher accuracy than the other baselines in terms of precision, recall, and F1 measure) and effectiveness of CG-HMI in recognizing and extracting the important HMI concepts through cross-modality learning. Our CG-HMI studies also provide real-world implications (e.g., road safety and driving behaviors) about the interactions between the drivers and other traffic participants.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"54 1","pages":"0"},"PeriodicalIF":4.5000,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-Modality Graph-based Language and Sensor Data Co-Learning of Human-Mobility Interaction\",\"authors\":\"Mahan Tabatabaie, Suining He, Kang G. Shin\",\"doi\":\"10.1145/3610904\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning the human--mobility interaction (HMI) on interactive scenes (e.g., how a vehicle turns at an intersection in response to traffic lights and other oncoming vehicles) can enhance the safety, efficiency, and resilience of smart mobility systems (e.g., autonomous vehicles) and many other ubiquitous computing applications. Towards the ubiquitous and understandable HMI learning, this paper considers both \\\"spoken language\\\" (e.g., human textual annotations) and \\\"unspoken language\\\" (e.g., visual and sensor-based behavioral mobility information related to the HMI scenes) in terms of information modalities from the real-world HMI scenarios. We aim to extract the important but possibly implicit HMI concepts (as the named entities) from the textual annotations (provided by human annotators) through a novel human language and sensor data co-learning design. To this end, we propose CG-HMI, a novel Cross-modality Graph fusion approach for extracting important Human-Mobility Interaction concepts from co-learning of textual annotations as well as the visual and behavioral sensor data. In order to fuse both unspoken and spoken \\\"languages\\\", we have designed a unified representation called the human--mobility interaction graph (HMIG) for each modality related to the HMI scenes, i.e., textual annotations, visual video frames, and behavioral sensor time-series (e.g., from the on-board or smartphone inertial measurement units). The nodes of the HMIG in these modalities correspond to the textual words (tokenized for ease of processing) related to HMI concepts, the detected traffic participant/environment categories, and the vehicle maneuver behavior types determined from the behavioral sensor time-series. To extract the inter- and intra-modality semantic correspondences and interactions in the HMIG, we have designed a novel graph interaction fusion approach with differentiable pooling-based graph attention. The resulting graph embeddings are then processed to identify and retrieve the HMI concepts within the annotations, which can benefit the downstream human-computer interaction and ubiquitous computing applications. We have developed and implemented CG-HMI into a system prototype, and performed extensive studies upon three real-world HMI datasets (two on car driving and the third one on e-scooter riding). We have corroborated the excellent performance (on average 13.11% higher accuracy than the other baselines in terms of precision, recall, and F1 measure) and effectiveness of CG-HMI in recognizing and extracting the important HMI concepts through cross-modality learning. Our CG-HMI studies also provide real-world implications (e.g., road safety and driving behaviors) about the interactions between the drivers and other traffic participants.\",\"PeriodicalId\":20553,\"journal\":{\"name\":\"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2023-09-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3610904\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3610904","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在交互场景中学习人机交互(HMI)(例如，车辆如何在十字路口转弯以响应交通灯和其他迎面而至的车辆)可以提高智能移动系统(例如自动驾驶汽车)和许多其他无处不在的计算应用程序的安全性、效率和弹性。为了实现无所不在和可理解的HMI学习，本文从现实HMI场景的信息模态角度考虑了“口头语言”(如人类文本注释)和“非口头语言”(如与HMI场景相关的基于视觉和传感器的行为移动信息)。我们的目标是通过一种新的人类语言和传感器数据共同学习设计，从文本注释(由人类注释者提供)中提取重要但可能隐含的HMI概念(作为命名实体)。为此，我们提出了CG-HMI，一种新的跨模态图融合方法，用于从文本注释以及视觉和行为传感器数据的共同学习中提取重要的人类移动交互概念。为了融合非言语和口头的“语言”，我们为与HMI场景相关的每种模态设计了一个统一的表示，称为人-移动交互图(HMIG)，即文本注释、视觉视频帧和行为传感器时间序列(例如，来自车载或智能手机惯性测量单元)。这些模式中的HMIG节点对应于与HMI概念、检测到的交通参与者/环境类别以及从行为传感器时间序列确定的车辆机动行为类型相关的文本单词(为便于处理而标记)。为了提取HMIG中模态间和模态内的语义对应和交互，我们设计了一种基于可微池的图注意的图交互融合方法。然后对生成的图嵌入进行处理，以识别和检索注释中的HMI概念，这有利于下游的人机交互和无处不在的计算应用程序。我们已经将CG-HMI开发并实现为系统原型，并对三个现实世界的HMI数据集(两个关于汽车驾驶，第三个关于电动滑板车骑)进行了广泛的研究。我们已经证实了CG-HMI通过跨模态学习在识别和提取重要HMI概念方面的优异性能(在精度、召回率和F1测量方面平均比其他基线高出13.11%)和有效性。我们的CG-HMI研究还提供了驾驶员和其他交通参与者之间互动的现实意义(例如，道路安全和驾驶行为)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cross-Modality Graph-based Language and Sensor Data Co-Learning of Human-Mobility Interaction

Learning the human--mobility interaction (HMI) on interactive scenes (e.g., how a vehicle turns at an intersection in response to traffic lights and other oncoming vehicles) can enhance the safety, efficiency, and resilience of smart mobility systems (e.g., autonomous vehicles) and many other ubiquitous computing applications. Towards the ubiquitous and understandable HMI learning, this paper considers both "spoken language" (e.g., human textual annotations) and "unspoken language" (e.g., visual and sensor-based behavioral mobility information related to the HMI scenes) in terms of information modalities from the real-world HMI scenarios. We aim to extract the important but possibly implicit HMI concepts (as the named entities) from the textual annotations (provided by human annotators) through a novel human language and sensor data co-learning design. To this end, we propose CG-HMI, a novel Cross-modality Graph fusion approach for extracting important Human-Mobility Interaction concepts from co-learning of textual annotations as well as the visual and behavioral sensor data. In order to fuse both unspoken and spoken "languages", we have designed a unified representation called the human--mobility interaction graph (HMIG) for each modality related to the HMI scenes, i.e., textual annotations, visual video frames, and behavioral sensor time-series (e.g., from the on-board or smartphone inertial measurement units). The nodes of the HMIG in these modalities correspond to the textual words (tokenized for ease of processing) related to HMI concepts, the detected traffic participant/environment categories, and the vehicle maneuver behavior types determined from the behavioral sensor time-series. To extract the inter- and intra-modality semantic correspondences and interactions in the HMIG, we have designed a novel graph interaction fusion approach with differentiable pooling-based graph attention. The resulting graph embeddings are then processed to identify and retrieve the HMI concepts within the annotations, which can benefit the downstream human-computer interaction and ubiquitous computing applications. We have developed and implemented CG-HMI into a system prototype, and performed extensive studies upon three real-world HMI datasets (two on car driving and the third one on e-scooter riding). We have corroborated the excellent performance (on average 13.11% higher accuracy than the other baselines in terms of precision, recall, and F1 measure) and effectiveness of CG-HMI in recognizing and extracting the important HMI concepts through cross-modality learning. Our CG-HMI studies also provide real-world implications (e.g., road safety and driving behaviors) about the interactions between the drivers and other traffic participants.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊