Adaptive semantic guidance network for video captioning☆

被引:0
|
作者
Liu, Yuanyuan [1 ]
Zhu, Hong [1 ]
Wu, Zhong [2 ]
Du, Sen [3 ]
Wu, Shuning [1 ]
Shi, Jing [1 ]
机构
[1] Xian Univ Technol, Sch Automat & Informat Engn, Xian 710048, Shaanxi, Peoples R China
[2] Yuncheng Univ, Shanxi Prov Intelligent Optoelect Sensing Applicat, Yuncheng 044000, Shanxi, Peoples R China
[3] Air Force Engn Univ, Informat & Nav Coll, Xian 710051, Shaanxi, Peoples R China
关键词
Video captioning; Adaptive semantic guidance network; Semantic enhancement encoder; Adaptive control decoder;
D O I
10.1016/j.cviu.2024.104255
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high- quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high- quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [32] Rethinking Network for Classroom Video Captioning
    Zhu, Mingjian
    Duan, Chenrui
    Yu, Changbin
    TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [33] Video Captioning Method Based on Semantic Topic Association
    Fu, Yan
    Yang, Ying
    Ye, Ou
    ELECTRONICS, 2025, 14 (05):
  • [34] Video Captioning with Semantic Information from the Knowledge Base
    Wang, Dan
    Song, Dandan
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (IEEE ICBK 2017), 2017, : 224 - 229
  • [35] Structured Encoding Based on Semantic Disambiguation for Video Captioning
    Sun, Bo
    Tian, Jinyu
    Wu, Yong
    Yu, Lunjun
    Tang, Yuanyan
    COGNITIVE COMPUTATION, 2024, 16 (03) : 1032 - 1048
  • [36] Semantic guidance incremental network for efficiency video super-resolution
    He, Xiaonan
    Xia, Yukun
    Qiao, Yuansong
    Lee, Brian
    Ye, Yuhang
    VISUAL COMPUTER, 2024, : 4899 - 4911
  • [37] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [38] Video captioning with stacked attention and semantic hard pull
    Rahman, Md Mushfiqur
    Abedin, Thasin
    Prottoy, Khondokar S. S.
    Moshruba, Ayana
    Siddiqui, Fazlul Hasan
    PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
  • [39] Richer Semantic Visual and Language Representation for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Wang, Hanzhang
    Xu, Kaisheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1871 - 1876
  • [40] Semantic Tag Augmented XlanV Model for Video Captioning
    Huang, Yiqing
    Xue, Hongwei
    Chen, Jiansheng
    Ma, Huimin
    Ma, Hongbing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822