CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network

被引:186
|
作者
Peng, Yuxin [1 ]
Qi, Jinwei [1 ]
Huang, Xin [1 ]
Yuan, Yuxin [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; fine-grained correlation; joint optimization; multi-task learning; REPRESENTATION; MODEL;
D O I
10.1109/TMM.2017.2742704
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: 1) In the first learning stage, they only model intramodality correlation, but ignore intermodality correlation with rich complementary context. 2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems, this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network, and the contributions are as follows: 1) In the first learning stage, CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In the second learning stage, a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling, which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets, the experimental results show our CCL approach achieves the best performance.
引用
收藏
页码:405 / 420
页数:16
相关论文
共 50 条
  • [1] Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation
    Hua, Yan
    Wang, Shuhui
    Liu, Siyuan
    Cai, Anni
    Huang, Qingming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (06) : 1201 - 1216
  • [2] A Cross-Modal Correlation Fusion Network for Emotion Recognition in Conversations
    Tang, Xiaolyu
    Cai, Guoyong
    Chen, Ming
    Yuan, Peicong
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 55 - 68
  • [3] Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification
    Peng, Cheng
    Zhang, Chunxia
    Xue, Xiaojun
    Gao, Jiameng
    Liang, Hongjian
    Niu, Zhengdong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2022, 27 (04) : 664 - 679
  • [4] Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification
    Cheng Peng
    Chunxia Zhang
    Xiaojun Xue
    Jiameng Gao
    Hongjian Liang
    Zhengdong Niu
    TsinghuaScienceandTechnology, 2022, 27 (04) : 664 - 679
  • [5] TINA: Cross-modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation
    Hua, Yan
    Wang, Shuhui
    Liu, Siyuan
    Huang, Qingming
    Cai, Anni
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 190 - 199
  • [6] Cross-modal semantic correlation learning by Bi-CNN network
    Wang, Chaoyi
    Li, Liang
    Yan, Chenggang
    Wang, Zhan
    Sun, Yaoqi
    Zhang, Jiyong
    IET IMAGE PROCESSING, 2021, 15 (14) : 3674 - 3684
  • [7] Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network
    Huang, Ju
    Lu, Pengtao
    Sun, Shuifa
    Wang, Fangyi
    ELECTRONICS, 2023, 12 (16)
  • [8] Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition
    Sunder Ali Khowaja
    Seok-Lyong Lee
    Neural Computing and Applications, 2020, 32 : 10423 - 10434
  • [9] Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition
    Khowaja, Sunder Ali
    Lee, Seok-Lyong
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (14): : 10423 - 10434
  • [10] IMPROVING CROSS-MODAL CORRELATION LEARNING WITH HYPERLINKS
    Wang, Shuhui
    Wu, Yiling
    Huang, Qingming
    2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2015,