GPT-4V(ision) for Robotics: Multimodal Task Planning From Human Demonstration

被引:2
|
作者
Wake, Naoki [1 ]
Kanehira, Atsushi [1 ]
Sasabuchi, Kazuhiro [1 ]
Takamatsu, Jun [1 ]
Ikeuchi, Katsushi [1 ]
机构
[1] Microsoft, Appl Robot Res, Redmond, WA 98052 USA
来源
关键词
Robots; Affordances; Pipelines; Planning; Collision avoidance; Visualization; Machine vision; Grounding; Data models; Training; Task and motion planning; task planning; imitation learning;
D O I
10.1109/LRA.2024.3477090
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4 V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos-objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4 V, highlighting the importance of incorporating human supervision within the pipeline.
引用
收藏
页码:10567 / 10574
页数:8
相关论文
共 50 条
  • [1] Map Reading and Analysis with GPT-4V(ision)
    Xu, Jinwen
    Tao, Ran
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (04)
  • [2] Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis
    Xu, Pusheng
    Chen, Xiaolan
    Zhao, Ziwei
    Shi, Danli
    BRITISH JOURNAL OF OPHTHALMOLOGY, 2024, 108 (10) : 1384 - 1389
  • [3] Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study
    Nakao, Takahiro
    Miki, Soichiro
    Nakamura, Yuta
    Kikuchi, Tomohiro
    Nomura, Yukihiro
    Hanaoka, Shouhei
    Yoshikawa, Takeharu
    Abe, Osamu
    JMIR MEDICAL EDUCATION, 2024, 10
  • [4] Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
    Lee, Gyeonggeon
    Zhai, Xiaoming
    TECHTRENDS, 2025, : 271 - 287
  • [5] Evaluation of Multimodal ChatGPT (GPT-4V) in Describing Mammography Image Features
    Haver, Hana
    Bahl, Manisha
    Doo, Florence
    Kamel, Peter
    Parekh, Vishwa
    Jeudy, Jean
    Yi, Paul
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (04): : 947 - 949
  • [6] GPT-4V passes the BLS and ACLS examinations: An analysis of GPT-4V's image recognition capabilities
    King, Ryan C.
    Bharani, Vishnu
    Shah, Kunal
    Yeo, Yee Hui
    Samaan, Jamil S.
    RESUSCITATION, 2024, 195
  • [7] Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases
    Schramm, Severin
    Preis, Silas
    Metz, Marie-Christin
    Jung, Kirsten
    Schmitz-Koep, Benita
    Zimmer, Claus
    Wiestler, Benedikt
    Hedderich, Dennis M.
    Kim, Su Hwan
    RADIOLOGY, 2025, 314 (01)
  • [8] Evaluating GPT-4V (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs
    Zhou, Yiliang
    Ong, Hanley
    Kennedy, Patrick
    Wu, Carol C.
    Kazam, Jacob
    Hentel, Keith
    Flanders, Adam
    Shih, George
    Peng, Yifan
    RADIOLOGY, 2024, 311 (02)
  • [9] Can Large Language Models Automatically Jailbreak GPT-4V?
    Wu, Yuanwei
    Huang, Yue
    Liu, Yixin
    Li, Xiang
    Zhou, Pan
    Sun, Lichao
    arXiv,
  • [10] How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites
    Chen, Zhe
    Wang, Weiyun
    Tian, Hao
    Ye, Shenglong
    Gao, Zhangwei
    Cui, Erfei
    Tong, Wenwen
    Hu, Kongzhi
    Luo, Jiapeng
    Ma, Zheng
    Ma, Ji
    Wang, Jiaqi
    Dong, Xiaoyi
    Yan, Hang
    Guo, Hewei
    He, Conghui
    Shi, Botian
    Jin, Zhenjiang
    Xu, Chao
    Wang, Bin
    Wei, Xingjian
    Li, Wei
    Zhang, Wenjian
    Zhang, Bo
    Cai, Pinlong
    Wen, Licheng
    Yan, Xiangchao
    Dou, Min
    Lu, Lewei
    Zhu, Xizhou
    Lu, Tong
    Lin, Dahua
    Qiao, Yu
    Dai, Jifeng
    Wang, Wenhai
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (12)