Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

被引:1
|
作者
Bayisa, Leta Yobsan [1 ]
Wang, Weidong [2 ]
Wang, Qingxian [2 ]
Ukwuoma, Chiagoziem C. [3 ,4 ,5 ]
Gutema, Hirpesa Kebede [1 ]
Endris, Ahmed [6 ]
Abu, Turi [7 ]
机构
[1] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen 518055, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu, Sichuan, Peoples R China
[3] Chengdu Univ Technol, Oxford Brooks Univ, Sino British Collaborat Educ, Chengdu 610059, Sichuan, Peoples R China
[4] Chengdu Univ Technol, Coll Nucl Technol & Automation Engn, Chengdu 610059, Sichuan, Peoples R China
[5] Chengdu Univ Technol, Sichuan Engn Technol Res Ctr Ind Internet Intellig, Chengdu 610059, Sichuan, Peoples R China
[6] Chinese Acad Sci, Shenzhen Inst Adv Technol, Paul C Lauterbur Res Ctr Biomed Imaging, Shenzhen, Peoples R China
[7] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
关键词
Multitask representation learning; Transfer learning; Encoder-decoder; Attention mechanism;
D O I
10.1007/s13042-024-02177-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The application of deep learning has demonstrated impressive performance in computer vision tasks such as object detection, image classification, and image captioning. Though most models excel at performing single vision or language tasks, designing a single architecture that balances task specialization, performance, and adaptability across diverse tasks is challenging. To effectively address vision and language integration challenges, a combination of text embeddings and visual representation is necessary to understand dependencies of each subarea for multiple tasks. This paper proposes a single architecture that can handle various tasks in computer vision with fine-tuning capabilities for other specific vision and language tasks. The proposed model employs a modified DenseNet201 as a feature extractor (network backbone), an encoder-decoder architecture, and a task-specific head for inference. To tackle overfitting and improve precision, enhanced data augmentation and normalization techniques are employed. The model's robustness is evaluated on over five datasets for different tasks: image classification, object detection, image captioning, and adversarial attack and defense. The experimental results demonstrate competitive performance compared to other works on CIFAR-10, CIFAR-100, Flickr8, Flickr30, Caltech10, and other task-specific datasets such as OCT, BreakHis, and so on. The proposed model is flexible and easy to adapt to new tasks, as it can also be extended to other vision and language tasks through fine-tuning with task-specific input indices.
引用
收藏
页码:4617 / 4637
页数:21
相关论文
共 50 条
  • [21] Facilitated Deep Learning Models for Image Captioning
    Azhar, Imtinan
    Afyouni, Imad
    Elnagar, Ashraf
    2021 55TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2021,
  • [22] Active Transfer Learning Network: A Unified Deep Joint Spectral-Spatial Feature Learning Model for Hyperspectral Image Classification
    Deng, Cheng
    Xue, Yumeng
    Liu, Xianglong
    Li, Chao
    Tao, Dacheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2019, 57 (03): : 1741 - 1754
  • [23] Towards Explainable Deep Learning for Image Captioning through Representation Space Perturbation
    Elguendouze, Sofiane
    de Souto, Marcilio C. P.
    Hafiane, Adel
    Halftermeyer, Anais
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [24] Big Data Image Classification Based on Distributed Deep Representation Learning Model
    Zhu, Minjun
    Chen, Qinghua
    IEEE Access, 2020, 8 : 133890 - 133904
  • [25] Big Data Image Classification Based on Distributed Deep Representation Learning Model
    Zhu, Minjun
    Chen, Qinghua
    IEEE ACCESS, 2020, 8 : 133890 - 133904
  • [26] Image Salient Object Detection Combined with Deep Learning
    Zhao Heng
    An Weisheng
    LASER & OPTOELECTRONICS PROGRESS, 2018, 55 (12)
  • [27] Ontology-Based Deep Learning Model for Object Detection and Image Classification in Smart City Concepts
    Adegun, Adekanmi Adeyinka
    Fonou-Dombeu, Jean Vincent
    Viriri, Serestina
    Odindi, John
    SMART CITIES, 2024, 7 (04): : 2182 - 2207
  • [28] Deep Learning Model of Image Classification Using Machine Learning
    Lv, Qing
    Zhang, Suzhen
    Wang, Yuechun
    ADVANCES IN MULTIMEDIA, 2022, 2022
  • [29] A reference-based model using deep learning for image captioning
    Tiago do Carmo Nogueira
    Cássio Dener Noronha Vinhal
    Gélson da Cruz Júnior
    Matheus Rudolfo Diedrich Ullmann
    Thyago Carvalho Marques
    Multimedia Systems, 2023, 29 : 1665 - 1681
  • [30] An Exploration of Deep Transfer Learning for Food Image Classification
    Islam, Kh Tohidul
    Wijewickrema, Sudanthi
    Pervez, Masud
    O'Leary, Stephen
    2018 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2018, : 368 - 372