Few-shot Incremental Learning with Textual-knowledge Embedding by Visual-language Model

被引：0

作者：

Yao H.-T. ^{[1
]}

Yu L. ^{[3
]}

Xu C.-S. ^{[1
,2
]}

机构：

[1] State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences), Beijing

[2] School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing

[3] School of Computer Science and Engineering, Tianjin University of Technology, Tianjin

来源：

Ruan Jian Xue Bao/Journal of Software | 2024年 / 35卷 / 05期

关键词：

class-space guided anti-forgetting learning; few-shot incremental learning (FSIL); textual-knowledge embedding; visual-language model;

D O I：

10.13328/j.cnki.jos.007022

中图分类号：

学科分类号：

摘要：

In real scenarios, the application often faces the problems of data scarcity and dynamic data changes. Few-shot incremental learning aims to use a small amount of data to infer data knowledge and reduce the model’s catastrophic forgetting of old knowledge. Existing few-shot incremental learning algorithms (CEC, FACT, etc.) mainly use visual features to adjust the feature encoder or classifier, so as to achieve the model’s transfer to new data and anti-forgetting of old data. However, the visual features of a small amount of data are often difficult to model a complete feature distribution of a class, resulting in weak generalization ability of the above algorithms. Compared with visual features, the text features of image class descriptions have better generalization and anti-forgetting abilities. Therefore, based on the visual language model (VLM), this study investigates the few-shot incremental learning based on textual knowledge embedding and realizes the effective learning of new and old class data in few-shot incremental learning by embedding text features with anti-forgetting ability in visual features. Specifically, in the basic learning stage, the study uses the VLM to extract the pre-trained visual features and class text descriptions of the image. Furthermore, the study uses the text encoder to project the pre-trained visual features to text space. Next, the study uses the visual encoder to fuse the learned text features and pre-trained visual features to abstract visual features with high discrimination ability. In the incremental learning stage, the study proposes the class space-guided anti-forgetting learning and uses the class space encoding of old data and new data features to fine-tune the visual encoder and text encoder, so as to achieve new data knowledge learning while reviewing old knowledge. This study also verifies the effectiveness of the algorithm on four datasets (CIFAR-100, CUB-200, Car-196, and miniImageNet), proving that textual knowledge embedding based on VLM can further improve the robustness of few-shot incremental learning on the basis of visual features. © 2024 Chinese Academy of Sciences. All rights reserved.

引用

页码：2101 / 2119

页数：18

共 61 条

[51] Liu H, Gu L, Chi ZX, Wang Y, Yu YH, Chen J, Tang J., Few-shot class-incremental learning via entropy-regularized data-free replay, Proc. of the 17th European Conf. on Computer Vision, pp. 146-162, (2022)
[52] Cheraghian A, Rahman S, Fang PF, Roy SK, Petersson L, Harandi M., Semantic-aware knowledge distillation for few-shot class-incremental learning, Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2534-2543, (2021)
[53] Dong SL, Hong XP, Tao XY, Chang XY, Wei X, Gong YH., Few-shot class-incremental learning via relation knowledge distillation, Proc. of the 35th AAAI Conf. on Artificial Intelligence, pp. 1255-1263, (2021)
[54] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I., Learning transferable visual models from natural language supervision, Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, pp. 8748-8763, (2021)
[55] Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han TD, Gong ZT, Samangooei S, Monteiro M, Menick JL, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A, Simonyan K., Flamingo: A visual language model for few-shot learning, Proc. of the 36th Int’l Conf. on Neural Information Processing Systems, pp. 23716-23736, (2022)
[56] Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le QV, Sung YH, Li Z, Duerig T., Scaling up visual and vision-language representation learning with noisy text supervision, Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, pp. 4904-4916, (2021)
[57] Krizhevsky A., Learning multiple layers of features from tiny images, (2009)
[58] Wah C, Branson S, Welinder P, Perona P, Belongie S., The caltech-UCSD birds-200-2011 dataset, (2011)
[59] Krause J, Stark M, Deng J, Fei-Fei L., 3D object representations for fine-grained categorization, Proc. of the 2013 IEEE Int’l Conf. on Computer Vision Workshops, pp. 554-561, (2013)
[60] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N., An image is worth 16x16 words: Transformers for image recognition at scale, Proc. of the 9th Int’l Conf. on Learning Representations, (2021)

← 1 2 3 4 5 6 7 →