Video captioning algorithm based on mixed training and semantic association

被引:0
|
作者
Chen, Shuqin [1 ,2 ]
Zhong, Xian [1 ,3 ]
Huang, Wenxin [4 ]
Lu, Yansheng [5 ]
机构
[1] School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan,430070, China
[2] School of Computer Science, Hubei University of Education, Wuhan,430205, China
[3] School of Information Science and Technology, Peking University, Beijing,100091, China
[4] School of Computer Science and Information Engineering, Hubei University, Wuhan,430062, China
[5] School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan,430074, China
关键词
Associative storage - Electric transformer testing - Image coding - Long short-term memory - Semantics;
D O I
10.13245/j.hust.230101
中图分类号
学科分类号
摘要
Aiming at the problem that the current mainstream methods used Transformer's self-attention base unit or long short-term memory (LSTM) unit to model the dependency of sequence words, which ignored the semantic relationship between words in the sentence and the problem of exposure bias in the training and testing phases, a video captioning algorithm hybridizing the training and semantic correlation (DC-RL) was proposed.In the encoder section, a bi-directional long short-term memory recurrent neural network (LSTM1) was used to fuse the appearance features and action features obtained from the pre-trained model.In the decoder stage, an attentional mechanism was used to dynamically extract visual features corresponding to the currently generated word for both the global semantic decoder and the self-learning decoder, alleviating the problem of exposure bias caused by the discrepancy between training and testing in the traditional global semantic decoder.In this case, the global semantic decoder used the words from the previous time step in the real description to drive the generation of the current word, and in addition, the global semantic information corresponding to the current word was extracted by the global semantic extractor to assist the generation of the current word.The self-learning decoder, on the other hand, used the semantic information of the word generated at the previous time step to drive the generation of the current word.The hybrid-trained fusion network used reinforcement learning to directly optimize the fusion network model by using the semantic information of the previous word, which enabled the generation of more accurate video captioning.Research results show that on the dataset MSR-VTT, the fusion network model improves over the baseline in the four metrics of B4, M, R and C by 2.3%, 0.3%, 1.0% and 1.9%, respectively, and the fusion network model optimized by using reinforcement learning improves by 2.0%, 0.5%, 1.9% and 6.1%, respectively. © 2023 Huazhong University of Science and Technology. All rights reserved.
引用
收藏
页码:67 / 74
相关论文
共 50 条
  • [21] Multimodal Context Fusion Based Dense Video Captioning Algorithm
    Li, Meiqi
    Zhou, Ziwei
    ENGINEERING LETTERS, 2025, 33 (04) : 1061 - 1072
  • [22] Attentive Visual Semantic Specialized Network for Video Captioning
    Perez-Martin, Jesus
    Bustos, Benjamin
    Perez, Jorge
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774
  • [23] Video Captioning with Semantic Information from the Knowledge Base
    Wang, Dan
    Song, Dandan
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (IEEE ICBK 2017), 2017, : 224 - 229
  • [24] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [25] Video captioning with stacked attention and semantic hard pull
    Rahman, Md Mushfiqur
    Abedin, Thasin
    Prottoy, Khondokar S. S.
    Moshruba, Ayana
    Siddiqui, Fazlul Hasan
    PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
  • [26] Richer Semantic Visual and Language Representation for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Wang, Hanzhang
    Xu, Kaisheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1871 - 1876
  • [27] Semantic Tag Augmented XlanV Model for Video Captioning
    Huang, Yiqing
    Xue, Hongwei
    Chen, Jiansheng
    Ma, Huimin
    Ma, Hongbing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
  • [28] Video Captioning based on Image Captioning as Subsidiary Content
    Vaishnavi, J.
    Narmatha, V
    2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [29] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [30] A Video Captioning Method by Semantic Topic-Guided Generation
    Ye, Ou
    Wei, Xinli
    Yu, Zhenhua
    Fu, Yan
    Yang, Ying
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (01): : 1071 - 1093