Emotional Video Captioning With Vision-Based Emotion Interpretation Network

被引:6
|
作者
Song, Peipei [1 ]
Guo, Dan [2 ,3 ,4 ]
Yang, Xun [1 ]
Tang, Shengeng [2 ]
Wang, Meng [2 ,5 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Dept Elect Engn & Informat Sci, Hefei 230026, Peoples R China
[2] Hefei Univ Technol HFUT, Sch Comp Sci & Informat Engn, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 230601, Peoples R China
[3] Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China
[4] Anhui Zhonghuitong Technol Co Ltd, Hefei 230094, Peoples R China
[5] China Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China
关键词
Emotional video captioning; emotion analysis; emotion-fact coordinated optimization;
D O I
10.1109/TIP.2024.3359045
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Effectively summarizing and re-expressing video content by natural languages in a more human-like fashion is one of the key topics in the field of multimedia content understanding. Despite good progress made in recent years, existing efforts usually overlooked the emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To fill the research gap, this paper presents a novel emotional video captioning framework in which we design a Vision-based Emotion Interpretation Network to effectively capture the emotions conveyed in videos and describe the visual content in both factual and emotional languages. Specifically, we first model the emotion distribution over an open psychological vocabulary to predict the emotional state of videos. Then, guided by the discovered emotional state, we incorporate visual context, textual context, and visual-textual relevance into an aggregated multimodal contextual vector to enhance video captioning. Furthermore, we optimize the network in a new emotion-fact coordinated way that involves two losses- Emotional Indication Loss and Factual Contrastive Loss, which penalize the error of emotion prediction and visual-textual factual relevance, respectively. In other words, we innovatively introduce emotional representation learning into an end-to-end video captioning network. Extensive experiments on public benchmark datasets, EmVidCap and EmVidCap-S, demonstrate that our method can significantly outperform the state-of-the-art methods by a large margin. Quantitative ablation studies and qualitative analyses clearly show that our method is able to effectively capture the emotions in videos and thus generate emotional language sentences to interpret the video content.
引用
收藏
页码:1122 / 1135
页数:14
相关论文
共 50 条
  • [11] A review of vision-based systems for soccer video analysis
    D'Orazio, T.
    Leo, M.
    PATTERN RECOGNITION, 2010, 43 (08) : 2911 - 2926
  • [12] Vision-based Network System for Industrial Applications
    Suesut, Taweepol
    Numsomran, Arjin
    Tipsuwanporn, Vittaya
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 12, 2006, 12 : 98 - 102
  • [13] Topology inference for a vision-based sensor network
    Marinakis, D
    Dudek, G
    2ND CANADIAN CONFERENCE ON COMPUTER AND ROBOT VISION, PROCEEDINGS, 2005, : 121 - 128
  • [14] Hybrid Reasoning Network for Video-based Commonsense Captioning
    Yu, Weijiang
    Liang, Jian
    Ji, Lei
    Li, Lu
    Fang, Yuejian
    Xiao, Nong
    Duan, Nan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5213 - 5221
  • [15] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
    Xu, Hui
    Zeng, Pengpeng
    Khan, Abdullah Aman
    ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
  • [16] Challenges and solutions for vision-based hand gesture interpretation: A review
    Gao, Kun
    Zhang, Haoyang
    Liu, Xiaolong
    Wang, Xinyi
    Xie, Liang
    Ji, Bowen
    Yan, Ye
    Yin, Erwei
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [17] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [18] Semantic Grouping Network for Video Captioning
    Ryu, Hobin
    Kang, Sunghun
    Kang, Haeyong
    Yoo, Chang D.
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
  • [19] Rethinking Network for Classroom Video Captioning
    Zhu, Mingjian
    Duan, Chenrui
    Yu, Changbin
    TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [20] Vision-based Detection and Tracking of Moving Target in Video Surveillance
    Ahmed, Sabri M. A. A.
    Khalifa, Ohtman O.
    2014 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE), 2014, : 16 - 19