PASTS: Progress-aware spatio-temporal transformer speaker for vision-and-language navigation

被引:2
|
作者
Wang, Liuyi [1 ]
Liu, Chengju [1 ,2 ]
He, Zongtao [1 ]
Li, Shu [1 ]
Yan, Qingqing [1 ]
Chen, Huiyi [3 ]
Chen, Qijun [1 ]
机构
[1] Tongji Univ, Dept Control Sci & Engn, Shanghai, Peoples R China
[2] Tongji Artificial Intelligence Suzhou Res Inst, Suzhou, Peoples R China
[3] Rutgers State Univ, New Brunswic, NJ USA
基金
中国国家自然科学基金;
关键词
Vision-and-language navigation; Natural language generation; Spatio-temporal transformer; Trajectory description;
D O I
10.1016/j.engappai.2023.107487
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms previous speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.
引用
收藏
页数:12
相关论文
共 35 条
  • [1] SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments
    Irshad, Muhammad Zubair
    Mithun, Niluthpol Chowdhury
    Seymour, Zachary
    Chiu, Han-Pang
    Samarasekera, Supun
    Kumar, Rakesh
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4065 - 4071
  • [2] History Aware Multimodal Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Schmid, Cordelia
    Laptev, Ivan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation
    Dou, Zi-Yi
    Peng, Nanyun
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4332 - 4340
  • [4] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [5] Improved Speaker and Navigator for Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Ting
    Wang, Donglin
    IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63
  • [6] SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
    Moudgil, Abhinav
    Majumdar, Arjun
    Agrawal, Harsh
    Lee, Stefan
    Batra, Dhruv
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [7] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
    Ni, Han
    Chen, Jia
    Zhu, DaYong
    Shi, Dianxi
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
  • [8] Speaker-Follower Models for Vision-and-Language Navigation
    Fried, Daniel
    Hu, Ronghang
    Cirik, Volkan
    Rohrbach, Anna
    Andreas, Jacob
    Morency, Louis-Philippe
    Berg-Kirkpatrick, Taylor
    Saenko, Kate
    Klein, Dan
    Darrell, Trevor
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [9] Sub-Instruction Aware Vision-and-Language Navigation
    Hong, Yicong
    Rodriguez-Opazo, Cristian
    Wu, Qi
    Gould, Stephen
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3360 - 3376
  • [10] Survey on the Research Progress and Development Trend of Vision-and-Language Navigation
    Niu K.
    Wang P.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (12): : 1815 - 1827