Adaptive semantic guidance network for video captioning☆

被引:0
|
作者
Liu, Yuanyuan [1 ]
Zhu, Hong [1 ]
Wu, Zhong [2 ]
Du, Sen [3 ]
Wu, Shuning [1 ]
Shi, Jing [1 ]
机构
[1] Xian Univ Technol, Sch Automat & Informat Engn, Xian 710048, Shaanxi, Peoples R China
[2] Yuncheng Univ, Shanxi Prov Intelligent Optoelect Sensing Applicat, Yuncheng 044000, Shanxi, Peoples R China
[3] Air Force Engn Univ, Informat & Nav Coll, Xian 710051, Shaanxi, Peoples R China
关键词
Video captioning; Adaptive semantic guidance network; Semantic enhancement encoder; Adaptive control decoder;
D O I
10.1016/j.cviu.2024.104255
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high- quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high- quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    Applied Intelligence, 2022, 52 : 5241 - 5260
  • [22] Memory-attended semantic context-aware network for video captioning
    Chen, Shuqin
    Zhong, Xian
    Wu, Shifeng
    Sun, Zhixin
    Liu, Wenxuan
    Jia, Xuemei
    Xia, Hongxia
    Soft Computing, 2021,
  • [23] TopicDVC: Dense Video Captioning with Topic Guidance
    Chen, Wei
    2024 IEEE 10TH INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD, EDGECOM 2024, 2024, : 82 - 87
  • [24] Video Captioning with Guidance of Multimodal Latent Topics
    Chen, Shizhe
    Chen, Jia
    Jin, Qin
    Hauptmann, Alexander
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
  • [25] Reconstruction Network for Video Captioning
    Wang, Bairui
    Ma, Lin
    Zhang, Wei
    Liu, Wei
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7622 - 7631
  • [26] Discriminative Latent Semantic Graph for Video Captioning
    Bai, Yang
    Wang, Junyan
    Long, Yang
    Hu, Bingzhang
    Song, Yang
    Pagnucco, Maurice
    Guan, Yu
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
  • [27] Semantic similarity information discrimination for video captioning
    Du, Sen
    Zhu, Hong
    Xiong, Ge
    Lin, Guangfeng
    Wang, Dong
    Shi, Jing
    Wang, Jing
    Xing, Nan
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 213
  • [28] Adaptive Curriculum Learning for Video Captioning
    Li, Shanhao
    Yang, Bang
    Zou, Yuexian
    IEEE ACCESS, 2022, 10 : 31751 - 31759
  • [29] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [30] ATMNet: Adaptive Two-Stage Modular Network for Accurate Video Captioning
    Xu, Tianyang
    Zhang, Yunjie
    Song, Xiaoning
    Feng, Zheng-Hua
    Wu, Xiao-Jun
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2025,