Semantic similarity information discrimination for video captioning

被引:3
|
作者
Du, Sen [1 ]
Zhu, Hong [1 ]
Xiong, Ge [1 ]
Lin, Guangfeng [2 ]
Wang, Dong [1 ]
Shi, Jing [1 ]
Wang, Jing [2 ]
Xing, Nan [1 ]
机构
[1] Xian Univ Technol, Sch Automation & Informat Engn, 5 South Jinhua Rd, Xian 710048, Shaanxi, Peoples R China
[2] Xian Univ Technol, Informat Sci Dept, 5 South Jinhua Rd, Xian 710048, Shaanxi, Peoples R China
关键词
Video captioning; Semantic detection; Bilinear pooling; Channel attention; Natural language processing; NETWORK;
D O I
10.1016/j.eswa.2022.118985
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is a task that aims to automatically describe objects and their actions in videos using natural language sentences. The correct understanding of vision and language information is critical for video captioning tasks. Many existing methods usually fuse different features to generate sentences. However, the sentences have many improper nouns and verbs. Inspired by the successes of fine-grained visual recognition, we treat the problem of improper words to discriminate semantic similarity information. In this paper, we designed a semantic bilinear block (SBB) to widen the gap between the probability of existing and nonexistent words, which can capture more fine-grained features to discriminate semantic information. Moreover, our designed linear attention block (LAB) implements the channelwise attention for the 1-D feature by simplifying the squeeze-and-excitation structure. Furthermore, we designed a semantic discrimination network (SDN) that integrates the LAB and SBB into video encoder and decoder to leverage successful channelwise attention and discriminate semantic similarity information for better video captioning. Experiments on two widely used datasets, MSVD and MSR-VTT, demonstrate that our proposed SDN can achieve better performance than state-of-the-art methods.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] RECENCY DISCRIMINATION AS A FUNCTION OF ACOUSTIC AND SEMANTIC SIMILARITY
    HACKER, MJ
    BULLETIN OF THE PSYCHONOMIC SOCIETY, 1980, 16 (03) : 149 - 149
  • [42] Dense Video Captioning With Early Linguistic Information Fusion
    Aafaq, Nayyer
    Mian, Ajmal
    Akhtar, Naveed
    Liu, Wei
    Shah, Mubarak
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
  • [43] Context Gating with Short Temporal Information for Video Captioning
    Xu, Jinlei
    Xu, Ting
    Tian, Xin
    Liu, Chunping
    Ji, Yi
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [44] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
    Gui, Yuling
    Guo, Dan
    Zhao, Ye
    PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
  • [45] BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
    Maosheng Zhong
    Hao Zhang
    Yong Wang
    Hao Xiong
    Machine Vision and Applications, 2022, 33
  • [46] Center-enhanced video captioning model with multimodal semantic alignment
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    NEURAL NETWORKS, 2024, 180
  • [47] BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
    Zhong, Maosheng
    Zhang, Hao
    Wang, Yong
    Xiong, Hao
    MACHINE VISION AND APPLICATIONS, 2022, 33 (05)
  • [48] Multi-level video captioning method based on semantic space
    Yao, Xiao
    Zeng, Yuanlin
    Gu, Min
    Yuan, Ruxi
    Li, Jie
    Ge, Junyi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72113 - 72130
  • [49] Global-Local Combined Semantic Generation Network for Video Captioning
    Mao L.
    Gao H.
    Yang D.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
  • [50] Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
    Shi, Botian
    Ji, Lei
    Niu, Zhendong
    Duan, Nan
    Zhou, Ming
    Chen, Xilin
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4337 - 4345