Multimodal Context Fusion Based Dense Video Captioning Algorithm

被引:0
|
作者
Li, Meiqi [1 ]
Zhou, Ziwei [1 ]
机构
[1] Univ Sci & Technol Liaoning, Sch Comp Sci & Software Engn, Anshan 114051, Peoples R China
关键词
Index Terms; Dense Video Description; Transformer; Mult-imodal feature fusion; Event context; SCN Decoder;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The core task of dense video description is to identify all events occurring in an unedited video and generate textual descriptions for these events. This has applications in fields such as assisting visually impaired individuals, generating news headlines, and enhancing human-computer interaction. However, existing dense video description models often overlook the role of textual information (e.g., road signs, subtitles) in video comprehension, as well as the contextual relationships between events, which are crucial for accurate description generation. To address these issues, this paper proposes a multimodal dense video description approach based on event-context fusion. The model utilizes a C3D network to extract visual features from the video and integrates OCR technology to extract textual information, thereby enhancing the semantic understanding of the video content. During feature extraction, sliding window and temporal alignment techniques are applied to ensure the temporal consistency of visual, audio, and textual features. A multimodal context fusion encoder is used to capture the temporal and semantic relationships between events and to deeply integrate multimodal features. The SCN decoder then generates descriptions word by word, improving both semantic consistency and fluency. The model is trained and evaluated on the MSVD and MSR-VTT datasets, and its performance is compared with several popular models. Experimental results show significant improvements in CIDEr evaluation scores, achieving 98.8 and 53.7 on the two datasets, respectively. Additionally, ablation studies are conducted to comprehensively assess the effectiveness and stability of each component of the model.
引用
收藏
页码:1061 / 1072
页数:12
相关论文
共 50 条
  • [1] Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
    Wang, Jingwen
    Jiang, Wenhao
    Ma, Lin
    Liu, Wei
    Xu, Yong
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7190 - 7198
  • [2] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [3] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
    Xu, Hui
    Zeng, Pengpeng
    Khan, Abdullah Aman
    ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
  • [4] Multimodal feature fusion based on object relation for video captioning
    Yan, Zhiwen
    Chen, Ying
    Song, Jinlong
    Zhu, Jia
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (01) : 247 - 259
  • [5] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [6] Dense Video Captioning With Early Linguistic Information Fusion
    Aafaq, Nayyer
    Mian, Ajmal
    Akhtar, Naveed
    Liu, Wei
    Shah, Mubarak
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
  • [7] Position embedding fusion on transformer for dense video captioning
    Yang, Sixuan
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
  • [8] Cross-Domain Modality Fusion for Dense Video Captioning
    Aafaq N.
    Mian A.
    Liu W.
    Akhtar N.
    Shah M.
    IEEE Transactions on Artificial Intelligence, 2022, 3 (05): : 763 - 777
  • [9] Dense video captioning based on local attention
    Qian, Yong
    Mao, Yingchi
    Chen, Zhihao
    Li, Chang
    Bloh, Olano Teah
    Huang, Qian
    IET IMAGE PROCESSING, 2023, 17 (09) : 2673 - 2685
  • [10] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42