CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

被引：1

作者：

Gao, Zhi ^{[1
,2
]}

Du, Yuntao ^{[2
]}

Zhang, Xintong ^{[2
,3
]}

Ma, Xiaojian ^{[2
]}

Han, Wenjuan ^{[3
]}

Zhu, Song-Chun ^{[1
,2
,4
]}

Li, Qing ^{[2
]}

机构：

[1] Peking Univ, Sch Intelligence Sci & Technol, Beijing, Peoples R China

[2] BIGAI, State Key Lab Gen Artificial Intelligence, Beijing, Peoples R China

[3] Beijing Jiaotong Univ, Beijing, Peoples R China

[4] Tsinghua Univ, Dept Automat, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01259

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge, we propose CLOVA, a Closed-LOop Visual Assistant, which operates within a framework encompassing inference, reflection, and learning phases. During the inference phase, LLMs generate programs and execute cor responding tools to complete assigned tasks. In the reflection phase, a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly, the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants.

引用

页码：13258 / 13268

页数：11

共 50 条

[1] BROADENING USAGE OF CLOSED-LOOP
Boughton, C.
DIABETES TECHNOLOGY & THERAPEUTICS, 2021, 23 : A7 - A7
[2] Closed-loop systems - Update 2020
Melmer, Andreas
Tripyla, Afroditi
Herzig, David
Laimer, Markus
Stettler, Christoph
Bally, Lia
THERAPEUTISCHE UMSCHAU, 2020, 77 (07) : 312 - 318
[3] Closed-loop visual grasping and manipulation
Yoshimi, BH
Allen, PK
IMAGE UNDERSTANDING WORKSHOP, 1996 PROCEEDINGS, VOLS I AND II, 1996, : 1353 - 1359
[4] Visual Closed-Loop Tracking with Area Stabilization
Karasev, Peter A.
Serrano, Miguel Moises
Vela, Patricio A.
Tannenbaum, Allen
2010 AMERICAN CONTROL CONFERENCE, 2010, : 6955 - 6961
[5] Dynamic effects in visual closed-loop systems
Corke, PI
Good, MC
IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, 1996, 12 (05): : 671 - 683
[6] Closed-loop learning of visual control policies
Jodogne, Sébastien
Piater, Justus H.
Journal of Artificial Intelligence Research, 1600, 28 : 349 - 391
[7] Closed-Loop Uncertainty Modeling for Visual Servoing
Assa, Akbar
Janabi-Sharifi, Farrokh
2013 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2013, : 3089 - 3094
[8] Closed-loop learning of visual control policies
Jodogne, Sebastien
Piater, Justus H.
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2007, 28 : 349 - 391
[9] Update on Peripheral Nerve Electrodes for Closed-Loop Neuroprosthetics
Rijnbeek, Emil H.
Eleveld, Nick
Olthuis, Wouter
FRONTIERS IN NEUROSCIENCE, 2018, 12
[10] CLAM: Closed-loop attention model for visual search
van der Velde, F
de Kamps, M
van der Kleij, GTV
COMPUTATIONAL NEUROSCIENCE: TRENDS IN RESEARCH 2004, 2004, : 607 - 612

← 1 2 3 4 5 →