Interpreting Natural Language Instructions Using Language, Vision, and Behavior

被引:3
|
作者
Benotti, Luciana [1 ,2 ]
Lau, Tessa [3 ]
Villalba, Martin [1 ,4 ]
机构
[1] Univ Nacl Cordoba, Cordoba, Argentina
[2] Consejo Nacl Invest Cient & Tecn, Buenos Aires, DF, Argentina
[3] Savioke Inc, Sunnyvale, CA USA
[4] Univ Potsdam, D-14476 Potsdam, Germany
关键词
Design; Algorithms; Performance; Natural language interpretation; multimodal understanding; action recognition; visual feedback; situated virtual agent; unsupervised learning;
D O I
10.1145/2629632
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We define the problem of automatic instruction interpretation as follows. Given a natural language instruction, can we automatically predict what an instruction follower, such as a robot, should do in the environment to follow that instruction? Previous approaches to automatic instruction interpretation have required either extensive domain-dependent rule writing or extensive manually annotated corpora. This article presents a novel approach that leverages a large amount of unannotated, easy-to-collect data from humans interacting in a game-like environment. Our approach uses an automatic annotation phase based on artificial intelligence planning, for which two different annotation strategies are compared: one based on behavioral information and the other based on visibility information. The resulting annotations are used as training data for different automatic classifiers. This algorithm is based on the intuition that the problem of interpreting a situated instruction can be cast as a classification problem of choosing among the actions that are possible in the situation. Classification is done by combining language, vision, and behavior information. Our empirical analysis shows that machine learning classifiers achieve 77% accuracy on this task on available English corpora and 74% on similar German corpora. Finally, the inclusion of human feedback in the interpretation process is shown to boost performance to 92% for the English corpus and 90% for the German corpus.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Connecting Language and Vision for Natural Language-Based Vehicle Retrieval
    Bai, Shuai
    Zheng, Zhedong
    Wang, Xiaohan
    Lin, Junyang
    Zhang, Zhu
    Zhou, Chang
    Yang, Hongxia
    Yang, Yi
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 4029 - 4038
  • [22] Interpreting vision and language generative models with semantic visual priors
    Cafagna, Michele
    Rojas-Barahona, Lina M.
    van Deemter, Kees
    Gatt, Albert
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [23] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
    Sammani, Fawaz
    Mukherjee, Tanmoy
    Deligiannis, Nikos
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
  • [24] Natural behavior is the language of the brain
    Miller, Cory T.
    Gire, David
    Hoke, Kim
    Huk, Alexander C.
    Kelley, Darcy
    Leopold, David A.
    Smear, Matthew C.
    Theunissen, Frederic
    Yartsev, Michael
    Niell, Cristopher M.
    CURRENT BIOLOGY, 2022, 32 (10) : R482 - R493
  • [25] Interpreting natural language descriptions of the topological relations of enclaves
    Wang, Xiaonan
    Zhang, Xiuyuan
    JOURNAL OF GEOGRAPHICAL SYSTEMS, 2025, : 301 - 335
  • [27] Detecting Target Objects by Natural Language Instructions Using an RGB-D Camera
    Bao, Jiatong
    Jia, Yunyi
    Cheng, Yu
    Tang, Hongru
    Xi, Ning
    SENSORS, 2016, 16 (12):
  • [28] Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
    Blukis, Valts
    Terme, Yannick
    Niklasson, Eyvind
    Knepper, Ross A.
    Artzi, Yoav
    CONFERENCE ON ROBOT LEARNING, VOL 100, 2019, 100
  • [29] Natural language texts for a cognitive vision system
    Arens, M
    Ottlik, A
    Nagel, HH
    ECAI 2002: 15TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2002, 77 : 455 - 459
  • [30] Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
    Iki, Taichi
    Aizawa, Akiko
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2189 - 2196