Emotional Video Captioning With Vision-Based Emotion Interpretation Network

被引：6

作者：

Song, Peipei ^{[1
]}

Guo, Dan ^{[2
,3
,4
]}

Yang, Xun ^{[1
]}

Tang, Shengeng ^{[2
]}

Wang, Meng ^{[2
,5
]}

机构：

[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Dept Elect Engn & Informat Sci, Hefei 230026, Peoples R China

[2] Hefei Univ Technol HFUT, Sch Comp Sci & Informat Engn, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 230601, Peoples R China

[3] Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China

[4] Anhui Zhonghuitong Technol Co Ltd, Hefei 230094, Peoples R China

[5] China Inst Artificial Intelligence, Hefei Comprehens Natl Sci Ctr, Hefei 230088, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Emotional video captioning; emotion analysis; emotion-fact coordinated optimization;

D O I：

10.1109/TIP.2024.3359045

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Effectively summarizing and re-expressing video content by natural languages in a more human-like fashion is one of the key topics in the field of multimedia content understanding. Despite good progress made in recent years, existing efforts usually overlooked the emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To fill the research gap, this paper presents a novel emotional video captioning framework in which we design a Vision-based Emotion Interpretation Network to effectively capture the emotions conveyed in videos and describe the visual content in both factual and emotional languages. Specifically, we first model the emotion distribution over an open psychological vocabulary to predict the emotional state of videos. Then, guided by the discovered emotional state, we incorporate visual context, textual context, and visual-textual relevance into an aggregated multimodal contextual vector to enhance video captioning. Furthermore, we optimize the network in a new emotion-fact coordinated way that involves two losses- Emotional Indication Loss and Factual Contrastive Loss, which penalize the error of emotion prediction and visual-textual factual relevance, respectively. In other words, we innovatively introduce emotional representation learning into an end-to-end video captioning network. Extensive experiments on public benchmark datasets, EmVidCap and EmVidCap-S, demonstrate that our method can significantly outperform the state-of-the-art methods by a large margin. Quantitative ablation studies and qualitative analyses clearly show that our method is able to effectively capture the emotions in videos and thus generate emotional language sentences to interpret the video content.

引用

页码：1122 / 1135

页数：14

共 50 条

[21] Semantic guidance network for video captioning
Guo, Lan
Zhao, Hong
Chen, Zhiwen
Han, Zeyu
SCIENTIFIC REPORTS, 2023, 13 (01)
[22] Guidance Module Network for Video Captioning
Zhang, Xiao
Liu, Chunsheng
Chang, Faliang
2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7955 - 7959
[23] Semantic guidance network for video captioning
Lan Guo
Hong Zhao
ZhiWen Chen
ZeYu Han
Scientific Reports, 13
[24] Vision-based estimation of the number of occupants using video cameras
Dino, Ipek Gursel
Kalfaoglu, Esat
Iseri, Orcun Koral
Erdogan, Bilge
Kalkan, Sinan
Alatan, Aydin
ADVANCED ENGINEERING INFORMATICS, 2022, 53
[25] A Video Annotation Tool Using Vision-based AR Technology
Yonemoto, Satoshi
PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON CYBERWORLDS, 2012, : 226 - 230
[26] A Vision-based Lighting Conditions Classification Method for Video Surveillance
Guo, En-Qiang
Fu, Xin-Sha
Lu, Yue
PROCEEDINGS OF 2015 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2015), 2015, : 759 - 762
[27] Vision-Based Computing Approach and Modeling for Wireless Network
Kang, Jinsuk
Jeon, Joongnam
Lee, Yeonseok
Jeong, Taikyeong Ted.
IETE TECHNICAL REVIEW, 2009, 26 (06) : 394 - 401
[28] Self-calibration of a vision-based sensor network
Marinakis, Dimitri
Dudek, Gregory
IMAGE AND VISION COMPUTING, 2009, 27 (1-2) : 116 - 130
[29] Hierarchical Vision-Language Alignment for Video Captioning
Zhang, Junchao
Peng, Yuxin
MULTIMEDIA MODELING (MMM 2019), PT I, 2019, 11295 : 42 - 54
[30] Efficient Vision-based Smart Meter Reading Network
Chen, Ching-Han
Chen, Ching-Yi
Hsia, Chih-Hsien
Wu, Guan-Xin
INTERNATIONAL JOURNAL OF WEB SERVICES RESEARCH, 2017, 14 (01) : 44 - 58

← 1 2 3 4 5 →