GPT-4V(ision) for Robotics: Multimodal Task Planning From Human Demonstration

被引:2
|
作者
Wake, Naoki [1 ]
Kanehira, Atsushi [1 ]
Sasabuchi, Kazuhiro [1 ]
Takamatsu, Jun [1 ]
Ikeuchi, Katsushi [1 ]
机构
[1] Microsoft, Appl Robot Res, Redmond, WA 98052 USA
来源
关键词
Robots; Affordances; Pipelines; Planning; Collision avoidance; Visualization; Machine vision; Grounding; Data models; Training; Task and motion planning; task planning; imitation learning;
D O I
10.1109/LRA.2024.3477090
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4 V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos-objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4 V, highlighting the importance of incorporating human supervision within the pipeline.
引用
收藏
页码:10567 / 10574
页数:8
相关论文
共 50 条
  • [21] Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing
    Hwang, Hochul
    Kwon, Sunjae
    Kim, Yekyung
    Kim, Donghyun
    2024 21ST INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS, UR 2024, 2024, : 281 - 288
  • [22] Evaluating the image recognition capabilities of GPT-4V and Gemini Pro in the Japanese national dental examination
    Fukuda, Hikaru
    Morishita, Masaki
    Muraoka, Kosuke
    Yamaguchi, Shino
    Nakamura, Taiji
    Yoshioka, Izumi
    Awano, Shuji
    Ono, Kentaro
    JOURNAL OF DENTAL SCIENCES, 2025, 20 (01) : 368 - 372
  • [23] Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
    Yang, Zhichao
    Yao, Zonghai
    Tasmin, Mahbuba
    Vashisht, Parth
    Jang, Won Seok
    Ouyang, Feiyun
    Wang, Beining
    Mcmanus, David
    Berlowitz, Dan
    Yu, Hong
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [24] Integrating Text and Image Analysis: Exploring GPT-4V's Capabilities in Advanced Radiological Applications Across Subspecialties
    Busch, Felix
    Han, Tianyu
    Makowski, Marcus R.
    Truhn, Daniel
    Bressem, Keno K.
    Adams, Lisa
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [25] Idea2Img: Iterative Self-refinement with GPT-4V for Automatic Image Design and Generation
    Yang, Zhengyuan
    Wang, Jianfeng
    Li, Linjie
    Lin, Kevin
    Lin, Chung-Ching
    Liu, Zicheng
    Wang, Lijuan
    COMPUTER VISION-ECCV 2024, PT XXXVIII, 2025, 15096 : 167 - 184
  • [26] Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases
    Suh, Pae Sun
    Shim, Woo Hyun
    Suh, Chong Hyun
    Heo, Hwon
    Park, Chae Ri
    Eom, Hye Joung
    Park, Kye Jin
    Choe, Jooae
    Kim, Pyeong Hwa
    Park, Hyo Jung
    Ahn, Yura
    Park, Ho Young
    Choi, Yoonseok
    Woo, Chang-Yun
    Park, Hyungjun
    RADIOLOGY, 2024, 312 (01) : e240273
  • [27] A decision-making model for self-driving vehicles based on GPT-4V, federated reinforcement learning, and blockchain
    Alam, Tanweer
    Gupta, Ruchi
    Ahamed, N. Nasurudeen
    Ullah, Arif
    Neural Computing and Applications, 2024, 36 (34) : 21545 - 21560
  • [28] Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)'s ability to interpret radiological images
    Zhu, Lingxuan
    Mou, Weiming
    Lai, Yancheng
    Chen, Jinghong
    Lin, Shujia
    Xu, Liling
    Lin, Junda
    Guo, Zeji
    Yang, Tao
    Lin, Anqi
    Qi, Chang
    Gan, Ling
    Zhang, Jian
    Luo, Peng
    INTERNATIONAL JOURNAL OF SURGERY, 2024, 110 (07) : 4096 - 4102
  • [29] Robot Learning from Demonstration: A Task-level Planning Approach
    Ekvall, Staffan
    Kragic, Danica
    INTERNATIONAL JOURNAL OF ADVANCED ROBOTIC SYSTEMS, 2008, 5 (03) : 223 - 234
  • [30] Grasp Pose Learning from Human Demonstration with Task Constraints
    Yinghui Liu
    Kun Qian
    Xin Xu
    Bo Zhou
    Fang Fang
    Journal of Intelligent & Robotic Systems, 2022, 105