Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

arXiv - CS - Multimedia Pub Date : 2024-08-09 DOI:arxiv-2408.05090

Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang

{"title":"Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation","authors":"Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang","doi":"arxiv-2408.05090","DOIUrl":null,"url":null,"abstract":"Vision and Language Navigation (VLN) is a challenging task that requires\nagents to understand instructions and navigate to the destination in a visual\nenvironment.One of the key challenges in outdoor VLN is keeping track of which\npart of the instruction was completed. To alleviate this problem, previous\nworks mainly focus on grounding the natural language to the visual input, but\nneglecting the crucial role of the agent's spatial position information in the\ngrounding process. In this work, we first explore the substantial effect of\nspatial position locating on the grounding of outdoor VLN, drawing inspiration\nfrom human navigation. In real-world navigation scenarios, before planning a\npath to the destination, humans typically need to figure out their current\nlocation. This observation underscores the pivotal role of spatial localization\nin the navigation process. In this work, we introduce a novel framework,\nLocating be for Planning (Loc4Plan), designed to incorporate spatial perception\nfor action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to\nperform the spatial localization before planning a decision action based on\ncorresponding guidance, which comprises a block-aware spatial locating (BAL)\nmodule and a spatial-aware action planning (SAP) module. Specifically, to help\nthe agent perceive its spatial location in the environment, we propose to learn\na position predictor that measures how far the agent is from the next\nintersection for reflecting its position, which is achieved by the BAL module.\nAfter the locating process, we propose the SAP module to incorporate spatial\ninformation to ground the corresponding guidance and enhance the precision of\naction planning. Extensive experiments on the Touchdown and map2seq datasets\nshow that the proposed Loc4Plan outperforms the SOTA methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.05090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment.One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the crucial role of the agent's spatial position information in the grounding process. In this work, we first explore the substantial effect of spatial position locating on the grounding of outdoor VLN, drawing inspiration from human navigation. In real-world navigation scenarios, before planning a path to the destination, humans typically need to figure out their current location. This observation underscores the pivotal role of spatial localization in the navigation process. In this work, we introduce a novel framework, Locating be for Planning (Loc4Plan), designed to incorporate spatial perception for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to perform the spatial localization before planning a decision action based on corresponding guidance, which comprises a block-aware spatial locating (BAL) module and a spatial-aware action planning (SAP) module. Specifically, to help the agent perceive its spatial location in the environment, we propose to learn a position predictor that measures how far the agent is from the next intersection for reflecting its position, which is achieved by the BAL module. After the locating process, we propose the SAP module to incorporate spatial information to ground the corresponding guidance and enhance the precision of action planning. Extensive experiments on the Touchdown and map2seq datasets show that the proposed Loc4Plan outperforms the SOTA methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Loc4Plan：先定位后规划，实现户外视觉和语言导航

视觉与语言导航（VLN）是一项具有挑战性的任务，它要求机器人在视觉环境中理解指令并导航到目的地。为了缓解这一问题，前人的研究主要集中在将自然语言与视觉输入接地，但忽略了代理的空间位置信息在接地过程中的关键作用。在这项工作中，我们首先从人类导航中汲取灵感，探索空间位置定位对室外 VLN 落地的实质性影响。在现实世界的导航场景中，在规划前往目的地的路径之前，人类通常需要弄清楚自己当前的位置。这一观察结果强调了空间定位在导航过程中的关键作用。在这项工作中，我们引入了一个新颖的框架--定位规划（Locating be for Planning，Loc4Plan），旨在将空间感知纳入户外 VLN 任务的行动规划中。Loc4Plan 背后的主要思想是在根据相应的指导规划决策行动之前进行空间定位，它包括一个块感知空间定位（BAL）模块和一个空间感知行动规划（SAP）模块。具体来说，为了帮助机器人感知其在环境中的空间位置，我们建议学习一个位置预测器，测量机器人距离下一个交叉路口有多远，以反映其位置，这由 BAL 模块实现。在 Touchdown 和 map2seq 数据集上的大量实验表明，所提出的 Loc4Plan 优于 SOTA 方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs