Emotional Video Captioning With Vision-Based Emotion Interpretation Network

被引:6
|
作者
Song, Peipei [1 ]
Guo, Dan [2 ,3 ,4 ]
Yang, Xun [1 ]
Tang, Shengeng [2 ]
Wang, Meng [2 ,5 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Dept Elect Engn & Informat Sci, Hefei 230026, Peoples R China
[2] Hefei Univ Technol HFUT, Sch Comp Sci & Informat Engn, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 230601, Peoples R China
[3] Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China
[4] Anhui Zhonghuitong Technol Co Ltd, Hefei 230094, Peoples R China
[5] China Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China
关键词
Emotional video captioning; emotion analysis; emotion-fact coordinated optimization;
D O I
10.1109/TIP.2024.3359045
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Effectively summarizing and re-expressing video content by natural languages in a more human-like fashion is one of the key topics in the field of multimedia content understanding. Despite good progress made in recent years, existing efforts usually overlooked the emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To fill the research gap, this paper presents a novel emotional video captioning framework in which we design a Vision-based Emotion Interpretation Network to effectively capture the emotions conveyed in videos and describe the visual content in both factual and emotional languages. Specifically, we first model the emotion distribution over an open psychological vocabulary to predict the emotional state of videos. Then, guided by the discovered emotional state, we incorporate visual context, textual context, and visual-textual relevance into an aggregated multimodal contextual vector to enhance video captioning. Furthermore, we optimize the network in a new emotion-fact coordinated way that involves two losses- Emotional Indication Loss and Factual Contrastive Loss, which penalize the error of emotion prediction and visual-textual factual relevance, respectively. In other words, we innovatively introduce emotional representation learning into an end-to-end video captioning network. Extensive experiments on public benchmark datasets, EmVidCap and EmVidCap-S, demonstrate that our method can significantly outperform the state-of-the-art methods by a large margin. Quantitative ablation studies and qualitative analyses clearly show that our method is able to effectively capture the emotions in videos and thus generate emotional language sentences to interpret the video content.
引用
收藏
页码:1122 / 1135
页数:14
相关论文
共 50 条
  • [1] Emotion-Prior Awareness Network for Emotional Video Captioning
    Song, Peipei
    Guo, Dan
    Yang, Xun
    Tang, Shengeng
    Yang, Erkun
    Wang, Meng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 589 - 600
  • [2] Contextual Attention Network for Emotional Video Captioning
    Song, Peipei
    Guo, Dan
    Cheng, Jun
    Wang, Meng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1858 - 1867
  • [3] Vision-based production of personalized video
    Kosmopoulos, D. I.
    Doulamis, A.
    Makris, A.
    Doulamis, N.
    Chatzis, S.
    Middleton, S. E.
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2009, 24 (03) : 158 - 176
  • [4] Video captioning based on vision transformer and reinforcement learning
    Zhao, Hong
    Chen, Zhiwen
    Guo, Lan
    Han, Zeyu
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [5] Video captioning based on vision transformer and reinforcement learning
    Zhao H.
    Chen Z.
    Guo L.
    Han Z.
    PeerJ Computer Science, 2022, 8
  • [6] Computer Vision-Based Video Interpretation Model for Automated Productivity Analysis of Construction Operations
    Gong, Jie
    Caldas, Carlos H.
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2010, 24 (03) : 252 - 263
  • [7] A recurrent emotional CMAC neural network controller for vision-based mobile robots
    Fang, Wubing
    Chao, Fei
    Yang, Longzhi
    Lin, Chih-Min
    Shang, Changjing
    Zhou, Changle
    Shen, Qiang
    NEUROCOMPUTING, 2019, 334 : 227 - 238
  • [8] Emotional Modeling for a Vision-Based Virtual Character
    Wang Guojiang
    Liu Jie
    PROCEEDINGS OF 2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2017, : 2010 - 2014
  • [9] Memory-Based Augmentation Network for Video Captioning
    Jing, Shuaiqi
    Zhang, Haonan
    Zeng, Pengpeng
    Gao, Lianli
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2367 - 2379
  • [10] Reconstruction Network for Video Captioning
    Wang, Bairui
    Ma, Lin
    Zhang, Wei
    Liu, Wei
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7622 - 7631