Episodic Transformer for Vision-and-Language Navigation

被引:51
|
作者
Pashevich, Alexander [1 ,2 ]
Schmid, Cordelia [2 ]
Sun, Chen [2 ,3 ]
机构
[1] INRIA, Le Chesnay, France
[2] Google Res, Mountain View, CA 94043 USA
[3] Brown Univ, Providence, RI 02912 USA
关键词
D O I
10.1109/ICCV48922.2021.01564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
引用
收藏
页码:15922 / 15932
页数:11
相关论文
共 50 条
  • [21] Diagnosing the Environment Bias in Vision-and-Language Navigation
    Zhang, Yubo
    Tan, Hao
    Bansal, Mohit
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
  • [22] Topological Planning with Transformers for Vision-and-Language Navigation
    Chen, Kevin
    Chen, Junshen K.
    Chuang, Jo
    Vazquez, Marynel
    Savarese, Silvio
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
  • [23] Scaling Data Generation in Vision-and-Language Navigation
    Wang, Zun
    Li, Jialu
    Hong, Yicong
    Wang, Yi
    Wu, Qi
    Bansal, Mohit
    Gould, Stephen
    Tan, Hao
    Qiao, Yu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
  • [24] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
    Liu, Shubo
    Zhang, Hongsheng
    Qi, Yuankai
    Wang, Peng
    Zhang, Yanning
    Wu, Qi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
  • [25] Vision-and-Language Navigation via Causal Learning
    Wang, Liuyi
    He, Zongtao
    Dang, Ronghao
    Shen, Mengjiao
    Liu, Chengju
    Chen, Qijun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
  • [26] CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation
    Magassouba, Aly
    Sugiura, Komei
    Kawai, Hisashi
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (04) : 6258 - 6265
  • [27] PASTS: Progress-aware spatio-temporal transformer speaker for vision-and-language navigation
    Wang, Liuyi
    Liu, Chengju
    He, Zongtao
    Li, Shu
    Yan, Qingqing
    Chen, Huiyi
    Chen, Qijun
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 128
  • [28] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Schmid, Cordelia
    Laptev, Ivan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16516 - 16526
  • [29] Reinforced Vision-and-Language Navigation Based on Historical BERT
    Zhang, Zixuan
    Qi, Shuhan
    Zhou, Zihao
    Zhang, Jiajia
    Yuan, Hao
    Wang, Xuan
    Wang, Lei
    Xiao, Jing
    ADVANCES IN SWARM INTELLIGENCE, ICSI 2023, PT II, 2023, 13969 : 427 - 438
  • [30] Diagnosing Vision-and-Language Navigation: What Really Matters
    Zhu, Wanrong
    Qi, Yuankai
    Narayana, Pradyumna
    Sone, Kazoo
    Basu, Sugato
    Wang, Eric Xin
    Wu, Qi
    Eckstein, Miguel
    Wang, William Yang
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5981 - 5993