Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning

被引:0
|
作者
Zhang, Bolin [1 ]
Kyutoku, Haruya [2 ]
Doman, Keisuke [3 ]
Komamizu, Takahiro [4 ]
Ide, Ichiro [5 ]
Qian, Jiangbo [1 ]
机构
[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Zhejiang, Peoples R China
[2] Aichi Univ Technol, Fac Engn, Gamagori, Aichi, Japan
[3] Chukyo Univ, Sch Engn, Toyota, Aichi, Japan
[4] Nagoya Univ, Math & Data Sci Ctr, Nagoya, Aichi, Japan
[5] Nagoya Univ, Grad Sch Informat, Nagoya, Aichi, Japan
关键词
Cross-modal recipe retrieval; Unified text encoder; Contrastive learning;
D O I
10.1016/j.knosys.2024.112641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal recipe retrieval is vital for transforming visual food cues into actionable cooking guidance, making culinary creativity more accessible. Existing methods separately encode the recipe Title, Ingredient, and Instruction using different text encoders, then aggregate them to obtain recipe feature, and finally match it with encoded image feature in a joint embedding space. These methods perform well but require significant computational cost. In addition, they only consider matching the entire recipe and the image but ignore the fine-grained correspondence between recipe components and the image, resulting in insufficient cross-modal interaction. To this end, we propose U nified T ext E ncoder with F ine-grained C ontrastive L earning (UTE-FCL) to achieve a simple but efficient model. Specifically, in each recipe, UTE-FCL first concatenates each of the Ingredient and Instruction texts composed of multiple sentences as a single text. Then, it connects these two concatenated texts with the original single-phrase Title to obtain the concatenated recipe. Finally, it encodes these three concatenated texts and the original Title by a Transformer-based Unified Text Encoder (UTE). This proposed structure greatly reduces the memory usage and improves the feature encoding efficiency. Further, we propose fine-grained contrastive learning objectives to capture the correspondence between recipe components and the image at Title, Ingredient, and Instruction levels by measuring the mutual information. Extensive experiments demonstrate the effectiveness of UTE-FCL compared to existing methods.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
    Liu, Hui
    Lv, Gang
    Gu, Yanhong
    Nian, Fudong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 298 - 310
  • [2] Fine-grained Feature Assisted Cross-modal Image-text Retrieval
    Bu, Chaofei
    Liu, Xueliang
    Huang, Zhen
    Su, Yuling
    Tu, Junfeng
    Hong, Richang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 306 - 320
  • [3] Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
    Han, Ning
    Chen, Jingjing
    Xiao, Guangyi
    Zhang, Hao
    Zeng, Yawen
    Chen, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3826 - 3834
  • [4] Cross-modal subspace learning for fine-grained sketch-based image retrieval
    Xu, Peng
    Yin, Qiyue
    Huang, Yongye
    Song, Yi-Zhe
    Ma, Zhanyu
    Wang, Liang
    Xiang, Tao
    Kleijn, W. Bastiaan
    Guo, Jun
    NEUROCOMPUTING, 2018, 278 : 75 - 86
  • [5] Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven
    Miao, Chunyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5517 - 5526
  • [6] Cross-modal knowledge learning with scene text for fine-grained image classification
    Xiong, Li
    Mao, Yingchi
    Wang, Zicheng
    Nie, Bingbing
    Li, Chang
    IET IMAGE PROCESSING, 2024, 18 (06) : 1447 - 1459
  • [7] Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval
    Wu, Hongfa
    Zhang, Lisai
    Chen, Qingcai
    Deng, Yimeng
    Siebert, Joanna
    Han, Yunpeng
    Li, Zhonghua
    Kong, Dejiang
    Cao, Zhao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 2158 - 2168
  • [8] TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval
    Li, Qiqi
    Ma, Longfei
    Jiang, Zheng
    Li, Mingyong
    Jin, Bo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 3713 - 3728
  • [9] Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
    Qingrong Cheng
    Xiaodong Gu
    Multimedia Tools and Applications, 2020, 79 : 31401 - 31428
  • [10] Fine-Grained Label Learning via Siamese Network for Cross-modal Information Retrieval
    Xu, Yiming
    Yu, Jing
    Guo, Jingjing
    Hu, Yue
    Tan, Jianlong
    COMPUTATIONAL SCIENCE - ICCS 2019, PT II, 2019, 11537 : 304 - 317