Multimodal Context Fusion Based Dense Video Captioning Algorithm

被引:0
|
作者
Li, Meiqi [1 ]
Zhou, Ziwei [1 ]
机构
[1] Univ Sci & Technol Liaoning, Sch Comp Sci & Software Engn, Anshan 114051, Peoples R China
关键词
Index Terms; Dense Video Description; Transformer; Mult-imodal feature fusion; Event context; SCN Decoder;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The core task of dense video description is to identify all events occurring in an unedited video and generate textual descriptions for these events. This has applications in fields such as assisting visually impaired individuals, generating news headlines, and enhancing human-computer interaction. However, existing dense video description models often overlook the role of textual information (e.g., road signs, subtitles) in video comprehension, as well as the contextual relationships between events, which are crucial for accurate description generation. To address these issues, this paper proposes a multimodal dense video description approach based on event-context fusion. The model utilizes a C3D network to extract visual features from the video and integrates OCR technology to extract textual information, thereby enhancing the semantic understanding of the video content. During feature extraction, sliding window and temporal alignment techniques are applied to ensure the temporal consistency of visual, audio, and textual features. A multimodal context fusion encoder is used to capture the temporal and semantic relationships between events and to deeply integrate multimodal features. The SCN decoder then generates descriptions word by word, improving both semantic consistency and fluency. The model is trained and evaluated on the MSVD and MSR-VTT datasets, and its performance is compared with several popular models. Experimental results show significant improvements in CIDEr evaluation scores, achieving 98.8 and 53.7 on the two datasets, respectively. Additionally, ablation studies are conducted to comprehensively assess the effectiveness and stability of each component of the model.
引用
收藏
页码:1061 / 1072
页数:12
相关论文
共 50 条
  • [11] Hierarchical Context-aware Network for Dense Video Event Captioning
    Ji, Lei
    Guo, Xianglin
    Huang, Haoyang
    Chen, Xilin
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2004 - 2013
  • [12] Survey of Dense Video Captioning
    Huang, Xiankai
    Zhang, Jiayu
    Wang, Xinyu
    Wang, Xiaochuan
    Liu, Ruijun
    Computer Engineering and Applications, 2023, 59 (12): : 28 - 48
  • [13] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
  • [14] Streamlined Dense Video Captioning
    Mun, Jonghwan
    Yang, Linjie
    Ren, Zhou
    Xu, Ning
    Han, Bohyung
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3581 - +
  • [15] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    Applied Intelligence, 2023, 53 : 23349 - 23368
  • [16] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [17] Research on Video Captioning Based on Multifeature Fusion
    Zhao, Hong
    Guo, Lan
    Chen, ZhiWen
    Zheng, HouZe
    Computational Intelligence and Neuroscience, 2022, 2022
  • [18] Research on Video Captioning Based on Multifeature Fusion
    Zhao, Hong
    Guo, Lan
    Chen, ZhiWen
    Zheng, HouZe
    Sun, Le
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [19] Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers
    Selbes, Berkay
    Sert, Mustafa
    PROCEEDINGS OF THE 2ND WORKSHOP ON USER-CENTRIC NARRATIVE SUMMARIZATION OF LONG VIDEOS, NARSUM 2023, 2023, : 51 - 56
  • [20] Deep multimodal embedding for video captioning
    Jin Young Lee
    Multimedia Tools and Applications, 2019, 78 : 31793 - 31805