{"title":"Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation","authors":"Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang","doi":"arxiv-2408.05090","DOIUrl":null,"url":null,"abstract":"Vision and Language Navigation (VLN) is a challenging task that requires\nagents to understand instructions and navigate to the destination in a visual\nenvironment.One of the key challenges in outdoor VLN is keeping track of which\npart of the instruction was completed. To alleviate this problem, previous\nworks mainly focus on grounding the natural language to the visual input, but\nneglecting the crucial role of the agent's spatial position information in the\ngrounding process. In this work, we first explore the substantial effect of\nspatial position locating on the grounding of outdoor VLN, drawing inspiration\nfrom human navigation. In real-world navigation scenarios, before planning a\npath to the destination, humans typically need to figure out their current\nlocation. This observation underscores the pivotal role of spatial localization\nin the navigation process. In this work, we introduce a novel framework,\nLocating be for Planning (Loc4Plan), designed to incorporate spatial perception\nfor action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to\nperform the spatial localization before planning a decision action based on\ncorresponding guidance, which comprises a block-aware spatial locating (BAL)\nmodule and a spatial-aware action planning (SAP) module. Specifically, to help\nthe agent perceive its spatial location in the environment, we propose to learn\na position predictor that measures how far the agent is from the next\nintersection for reflecting its position, which is achieved by the BAL module.\nAfter the locating process, we propose the SAP module to incorporate spatial\ninformation to ground the corresponding guidance and enhance the precision of\naction planning. Extensive experiments on the Touchdown and map2seq datasets\nshow that the proposed Loc4Plan outperforms the SOTA methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.05090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vision and Language Navigation (VLN) is a challenging task that requires
agents to understand instructions and navigate to the destination in a visual
environment.One of the key challenges in outdoor VLN is keeping track of which
part of the instruction was completed. To alleviate this problem, previous
works mainly focus on grounding the natural language to the visual input, but
neglecting the crucial role of the agent's spatial position information in the
grounding process. In this work, we first explore the substantial effect of
spatial position locating on the grounding of outdoor VLN, drawing inspiration
from human navigation. In real-world navigation scenarios, before planning a
path to the destination, humans typically need to figure out their current
location. This observation underscores the pivotal role of spatial localization
in the navigation process. In this work, we introduce a novel framework,
Locating be for Planning (Loc4Plan), designed to incorporate spatial perception
for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to
perform the spatial localization before planning a decision action based on
corresponding guidance, which comprises a block-aware spatial locating (BAL)
module and a spatial-aware action planning (SAP) module. Specifically, to help
the agent perceive its spatial location in the environment, we propose to learn
a position predictor that measures how far the agent is from the next
intersection for reflecting its position, which is achieved by the BAL module.
After the locating process, we propose the SAP module to incorporate spatial
information to ground the corresponding guidance and enhance the precision of
action planning. Extensive experiments on the Touchdown and map2seq datasets
show that the proposed Loc4Plan outperforms the SOTA methods.
视觉与语言导航(VLN)是一项具有挑战性的任务,它要求机器人在视觉环境中理解指令并导航到目的地。为了缓解这一问题,前人的研究主要集中在将自然语言与视觉输入接地,但忽略了代理的空间位置信息在接地过程中的关键作用。在这项工作中,我们首先从人类导航中汲取灵感,探索空间位置定位对室外 VLN 落地的实质性影响。在现实世界的导航场景中,在规划前往目的地的路径之前,人类通常需要弄清楚自己当前的位置。这一观察结果强调了空间定位在导航过程中的关键作用。在这项工作中,我们引入了一个新颖的框架--定位规划(Locating be for Planning,Loc4Plan),旨在将空间感知纳入户外 VLN 任务的行动规划中。Loc4Plan 背后的主要思想是在根据相应的指导规划决策行动之前进行空间定位,它包括一个块感知空间定位(BAL)模块和一个空间感知行动规划(SAP)模块。具体来说,为了帮助机器人感知其在环境中的空间位置,我们建议学习一个位置预测器,测量机器人距离下一个交叉路口有多远,以反映其位置,这由 BAL 模块实现。在 Touchdown 和 map2seq 数据集上的大量实验表明,所提出的 Loc4Plan 优于 SOTA 方法。