GPT-4V(ision) for Robotics: Multimodal Task Planning From Human Demonstration

被引:2
|
作者
Wake, Naoki [1 ]
Kanehira, Atsushi [1 ]
Sasabuchi, Kazuhiro [1 ]
Takamatsu, Jun [1 ]
Ikeuchi, Katsushi [1 ]
机构
[1] Microsoft, Appl Robot Res, Redmond, WA 98052 USA
来源
关键词
Robots; Affordances; Pipelines; Planning; Collision avoidance; Visualization; Machine vision; Grounding; Data models; Training; Task and motion planning; task planning; imitation learning;
D O I
10.1109/LRA.2024.3477090
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4 V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos-objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4 V, highlighting the importance of incorporating human supervision within the pipeline.
引用
收藏
页码:10567 / 10574
页数:8
相关论文
共 50 条
  • [31] Grasp Pose Learning from Human Demonstration with Task Constraints
    Liu, Yinghui
    Qian, Kun
    Xu, Xin
    Zhou, Bo
    Fang, Fang
    JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, 2022, 105 (02)
  • [32] Joining Force of Human Muscular Task Planning With Robot Robust and Delicate Manipulation for Programming by Demonstration
    Wang, Fei
    Zhou, Xingqun
    Wang, Jianhui
    Zhang, Xing
    He, Zhenquan
    Song, Bo
    IEEE-ASME TRANSACTIONS ON MECHATRONICS, 2020, 25 (05) : 2574 - 2584
  • [33] Demonstration of the EMPATHIC Framework for Task Learning from Implicit Human Feedback
    Cui, Yuchen
    Zhang, Qiping
    Jain, Sahil
    Allievi, Alessandro
    Stone, Peter
    Niekum, Scott
    Knox, W. Bradley
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 16017 - 16019
  • [34] Robot Learning from Human Demonstration of Peg-in-Hole Task
    Wang, Peng
    Zhu, Jianxin
    Feng, Wei
    Ou, Yongsheng
    2018 IEEE 8TH ANNUAL INTERNATIONAL CONFERENCE ON CYBER TECHNOLOGY IN AUTOMATION, CONTROL, AND INTELLIGENT SYSTEMS (IEEE-CYBER), 2018, : 318 - 322
  • [35] Learning a Pick-and-Place Robot Task from Human Demonstration
    Lin, Hsien-, I
    Cheng, Chia-Hsien
    Chen, Wei-Kai
    2013 CACS INTERNATIONAL AUTOMATIC CONTROL CONFERENCE (CACS), 2013, : 312 - +
  • [36] Boosting GPT-4V's accuracy in dermoscopic classification with few-shot learning. Comment on "can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study"
    Wang, Jinge
    Hu, Gangqing
    JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 91 (06) : e165 - e166
  • [37] Part-Based Robot Grasp Planning from Human Demonstration
    Aleotti, Jacopo
    Caselli, Stefano
    2011 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2011,
  • [38] Trajectory Planning under Different Initial Conditions for Surgical Task Automation by Learning from Demonstration
    Osa, Takayuki
    Harada, Kanako
    Sugita, Naohiko
    Mitsuishi, Mamoru
    2014 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2014, : 6507 - 6513
  • [39] Exploring Implicit Human Responses to Robot Mistakes in a Learning from Demonstration Task
    Hayes, Cory J.
    Moosaei, Maryam
    Riek, Laurel D.
    2016 25TH IEEE INTERNATIONAL SYMPOSIUM ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION (RO-MAN), 2016, : 246 - 252
  • [40] Learning from Demonstration Facilitates Human-Robot Collaborative Task Execution
    Koskinopoulou, Maria
    Piperakis, Stylimos
    Frahanias, Panos
    ELEVENTH ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN ROBOT INTERACTION (HRI'16), 2016, : 59 - 66