PASTS: Progress-aware spatio-temporal transformer speaker for vision-and-language navigation

被引:2
|
作者
Wang, Liuyi [1 ]
Liu, Chengju [1 ,2 ]
He, Zongtao [1 ]
Li, Shu [1 ]
Yan, Qingqing [1 ]
Chen, Huiyi [3 ]
Chen, Qijun [1 ]
机构
[1] Tongji Univ, Dept Control Sci & Engn, Shanghai, Peoples R China
[2] Tongji Artificial Intelligence Suzhou Res Inst, Suzhou, Peoples R China
[3] Rutgers State Univ, New Brunswic, NJ USA
基金
中国国家自然科学基金;
关键词
Vision-and-language navigation; Natural language generation; Spatio-temporal transformer; Trajectory description;
D O I
10.1016/j.engappai.2023.107487
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms previous speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.
引用
收藏
页数:12
相关论文
共 35 条
  • [31] Toward Benchmarking of Long-Term Spatio-Temporal Maps of Pedestrian Flows for Human-Aware Navigation
    Vintr, Tomas
    Blaha, Jan
    Rektoris, Martin
    Ulrich, Jiri
    Roucek, Tomas
    Broughton, George
    Yan, Zhi
    Krajnik, Tomas
    FRONTIERS IN ROBOTICS AND AI, 2022, 9
  • [32] Group-Aware Robot Navigation in Crowds Using Spatio-Temporal Graph Attention Network With Deep Reinforcement Learning
    Lu, Xiaojun
    Faragasso, Angela
    Wang, Yongdong
    Yamashita, Atsushi
    Asama, Hajime
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (04): : 4140 - 4147
  • [33] A novel spatio-temporal vision transformer model for improving wetland mapping using multi-seasonal sentinel data
    Marjani, Mohammad
    Mohammadimanesh, Fariba
    Mahdianpari, Masoud
    Gill, Eric W.
    REMOTE SENSING APPLICATIONS-SOCIETY AND ENVIRONMENT, 2025, 37
  • [34] Spatio-temporal Perceiving Network Based Vision Transformer for 6-Hour Precipitation Prediction Using Multi-meteorological Factors
    Hu, Jing
    Zheng, Peng
    Zhang, Honghu
    Wu, Xi
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT II, 2025, 15032 : 203 - 217
  • [35] TransVFS: A spatio-temporal local-global transformer for vision-based force sensing during ultrasound-guided prostate biopsy
    Wang, Yibo
    Ye, Zhichao
    Wen, Mingwei
    Liang, Huageng
    Zhang, Xuming
    MEDICAL IMAGE ANALYSIS, 2024, 93