Towards artificial general intelligence via a multimodal foundation model

被引:161
作者
Fei, Nanyi [1 ,2 ,3 ]
Lu, Zhiwu [1 ,2 ]
Gao, Yizhao [1 ,2 ]
Yang, Guoxing [1 ,2 ]
Huo, Yuqi [2 ,3 ]
Wen, Jingyuan [1 ,2 ]
Lu, Haoyu [1 ,2 ]
Song, Ruihua [1 ,2 ]
Gao, Xin [4 ]
Xiang, Tao [5 ]
Sun, Hao [1 ,2 ]
Wen, Ji-Rong [1 ,2 ,3 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
[2] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China
[3] Renmin Univ China, Sch Informat, Beijing, Peoples R China
[4] King Abdullah Univ Sci & Technol, Comp Elect & Math Sci & Engn Div, Thuwal, Saudi Arabia
[5] Univ Surrey, Dept Elect & Elect Engn, Guildford, Surrey, England
基金
中国国家自然科学基金;
关键词
SINGLE NEURONS;
D O I
10.1038/s41467-022-30761-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of human. Despite tremendous success in the AI research, most of existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks. To achieve this goal, we propose to pre-train our foundation model by self-supervised learning with weak semantic correlation data crawled from the Internet and show that promising results can be obtained on a wide range of downstream tasks. Particularly, with the developed model-interpretability tools, we demonstrate that strong imagination ability is now possessed by our foundation model. We believe that our work makes a transformative stride towards AGI, from our common practice of "weak or narrow AI" to that of "strong or generalized AI". Artificial intelligence approaches inspired by human cognitive function have usually single learned ability. The authors propose a multimodal foundation model that demonstrates the cross-domain learning and adaptation for broad range of downstream cognitive tasks.
引用
收藏
页数:13
相关论文
共 60 条
[1]  
Alec RadfordKarthik Narasimhan., 2018, IMPROVING LANGUAGE U
[2]  
Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1109/CVPR.2018.00636, 10.1002/ett.70087]
[3]  
[Anonymous], 2010, INT C MACH LEARN
[4]  
[Anonymous], 2021, EDITORS TECHNOLOGY R
[5]  
[Anonymous], 2014, 27THINT C NEURAL INF
[6]  
Bommasani R., 2021, arXiv
[7]  
Brown TB, 2020, ADV NEUR IN, V33
[8]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[9]  
Chen T, 2020, PR MACH LEARN RES, V119
[10]   Exploring Simple Siamese Representation Learning [J].
Chen, Xinlei ;
He, Kaiming .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15745-15753