Multimodal Context Fusion Based Dense Video Captioning Algorithm

被引:0
|
作者
Li, Meiqi [1 ]
Zhou, Ziwei [1 ]
机构
[1] Univ Sci & Technol Liaoning, Sch Comp Sci & Software Engn, Anshan 114051, Peoples R China
关键词
Index Terms; Dense Video Description; Transformer; Mult-imodal feature fusion; Event context; SCN Decoder;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The core task of dense video description is to identify all events occurring in an unedited video and generate textual descriptions for these events. This has applications in fields such as assisting visually impaired individuals, generating news headlines, and enhancing human-computer interaction. However, existing dense video description models often overlook the role of textual information (e.g., road signs, subtitles) in video comprehension, as well as the contextual relationships between events, which are crucial for accurate description generation. To address these issues, this paper proposes a multimodal dense video description approach based on event-context fusion. The model utilizes a C3D network to extract visual features from the video and integrates OCR technology to extract textual information, thereby enhancing the semantic understanding of the video content. During feature extraction, sliding window and temporal alignment techniques are applied to ensure the temporal consistency of visual, audio, and textual features. A multimodal context fusion encoder is used to capture the temporal and semantic relationships between events and to deeply integrate multimodal features. The SCN decoder then generates descriptions word by word, improving both semantic consistency and fluency. The model is trained and evaluated on the MSVD and MSR-VTT datasets, and its performance is compared with several popular models. Experimental results show significant improvements in CIDEr evaluation scores, achieving 98.8 and 53.7 on the two datasets, respectively. Additionally, ablation studies are conducted to comprehensively assess the effectiveness and stability of each component of the model.
引用
收藏
页码:1061 / 1072
页数:12
相关论文
共 50 条
  • [21] Weakly Supervised Dense Video Captioning
    Shen, Zhiqiang
    Li, Jianguo
    Su, Zhou
    Li, Minjun
    Chen, Yurong
    Jiang, Yu-Gang
    Xue, Xiangyang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167
  • [22] An Efficient Framework for Dense Video Captioning
    Suin, Maitreya
    Rajagopalan, A. N.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046
  • [23] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [24] Deep multimodal embedding for video captioning
    Lee, Jin Young
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [25] Dense Video Captioning for Incomplete Videos
    Dang, Xuan
    Wang, Guolong
    Xiong, Kun
    Qin, Zheng
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 665 - 676
  • [26] Multimodal object description network for dense captioning
    Wang, Weixuan
    Hu, Haifeng
    ELECTRONICS LETTERS, 2017, 53 (15) : 1041 - +
  • [27] Low-rank Multimodal Fusion Algorithm Based on Context Modeling
    Bai, Zongwen
    Chen, Xiaohuan
    Zhou, Meili
    Yi, Tingting
    Chien, Wei-Che
    JOURNAL OF INTERNET TECHNOLOGY, 2021, 22 (04): : 913 - 921
  • [28] Learning Object Context for Dense Captioning
    Li, Xiangyang
    Jiang, Shuqiang
    Han, Jungong
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8650 - 8657
  • [29] Context and Attribute Grounded Dense Captioning
    Yin, Guojun
    Sheng, Lu
    Liu, Bin
    Yu, Nenghai
    Wang, Xiaogang
    Shao, Jing
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6234 - 6243
  • [30] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    NEUROCOMPUTING, 2019, 329 : 476 - 485