Fine grained cross-modal retrieval algorithm for IETM with attention mechanism fused

被引:0
|
作者
Zhai Y. [1 ]
Gu J. [1 ]
Zong F. [1 ]
Jiang W. [1 ]
机构
[1] Coastal Defense College, Naval Aviation University, Yantai
关键词
attention mechanism; cross-modal; image-text retrieval; interactive electronic technical manual;
D O I
10.12305/j.issn.1001-506X.2023.12.21
中图分类号
学科分类号
摘要
Interactive electronic manual is an important technology to improve the informatization and intelligence of various equipment support. Aiming at the problem of single retrieval modal, an improved fine grained cross-modal retrieval algorithm with attention mechanism fused is proposed, which takes the graphic descriptions of the data as the research object. In view of the characteristics of many image sketches and single color in the data, the feature extraction module uses the Vision Transformer model and Transformer encoder to obtain the global and local features of the picture and text, respectively. Moreover, the attention mechanism is applied to mine fine grained information between and within graphic and text modes, and text confrontation training is added to enhance the model's generalization ability. In addition, the cross-modal joint loss function is used to constrain the model. Verifying on the Pascal Sentence dataset and self-built dataset, the average accuracy of the proposed method reaches 0. 964 and 0. 959 respectively, which is 0. 248 and 0. 214 higher than the benchmark model deep supervised cross modal retrieval (DSCMR), respectively. © 2023 Chinese Institute of Electronics. All rights reserved.
引用
收藏
页码:3915 / 3923
页数:8
相关论文
共 30 条
  • [1] SONG P., Research and development of intelligent interactive electronic technical manual system, (2020)
  • [2] LIU Y, GUO Y Y, FANG J., Et al., A survey of research on deep learning cross-modal image text retrieval, Computer Science and Exploration, 16, 3, pp. 489-511, (2022)
  • [3] PENG Y X, HUANG X, ZHAO Y Z., An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges, IEEE Trans, on Circuits and Systems for Video Technology, 28, 9, pp. 2372-2385, (2017)
  • [4] ZHU L, TIAN X M, CAO S N, Et al., Subspace cross-modal retrieval based on high-order semantic correlation, Data Analysis and Knowledge Discovery, 4, 5, pp. 84-91, (2020)
  • [5] RASIWASIA N, COSTA P J., A new approach to cross-modal multimedia retrieval, Proc. of the 18th ACM International Conference on Multimedia, pp. 251-260, (2010)
  • [6] WANG K Y, YIN Q Y, WANG W, Et al., A comprehensive survey on cross-modal retrieval [E B/O L]
  • [7] KAUR P, PANNU H S, MALHI A K., Comparative analysis on cross-modal information retrieval: a review, Computer Science Review, 39, 2, (2021)
  • [8] BAHDANAU D, CHO K, BENGIO Y., Neural machine translation by jointly learning to align and translate
  • [9] XUE J Y., Cross-modal retrieval of hand drawn sketches, (2020)
  • [10] HU J, SHEN L, SAMUEL A., Squeeze-and-excitation networks, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, (2018)