Episodic Transformer for Vision-and-Language Navigation

被引:51
|
作者
Pashevich, Alexander [1 ,2 ]
Schmid, Cordelia [2 ]
Sun, Chen [2 ,3 ]
机构
[1] INRIA, Le Chesnay, France
[2] Google Res, Mountain View, CA 94043 USA
[3] Brown Univ, Providence, RI 02912 USA
关键词
D O I
10.1109/ICCV48922.2021.01564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
引用
收藏
页码:15922 / 15932
页数:11
相关论文
共 50 条
  • [41] Action Inference for Destination Prediction in Vision-and-Language Navigation
    Kondapally, Anirudh Reddy
    Yamada, Kentaro
    Yanaka, Hitomi
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 210 - 217
  • [42] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [43] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
    Kim, Wonjae
    Son, Bokyung
    Kim, Iidoo
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [44] Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
    Hu, Ronghang
    Fried, Daniel
    Rohrbach, Anna
    Klein, Dan
    Darrell, Trevor
    Saenko, Kate
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6551 - 6557
  • [45] Cluster-based Curriculum Learning for Vision-and-Language Navigation
    Wang, Ting
    Wu, Zongkai
    Liu, Zihan
    Wang, Donglin
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [46] Vision-and-Language Navigation via Latent Semantic Alignment Learning
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8406 - 8418
  • [47] Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
    Hwang, Jisu
    Kim, Incheol
    SENSORS, 2021, 21 (03) : 1 - 23
  • [48] FedVLN: Privacy-Preserving Federated Vision-and-Language Navigation
    Zhou, Kaiwen
    Wang, Xin Eric
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 682 - 699
  • [49] Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness
    Gao, Fang
    Tang, Jingfeng
    Wang, Jiabao
    Li, Shaodong
    Yu, Jun
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (12): : 10874 - 10881
  • [50] Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation
    Yu, Felix
    Deng, Zhiwei
    Narasimhan, Karthik
    Russakovsky, Olga
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4000 - 4004