Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

被引:9
|
作者
Yan, Yichao [1 ]
Zhuang, Ning [1 ]
Ni, Bingbing [1 ]
Zhang, Jian [1 ]
Xu, Minghao [1 ]
Zhang, Qiang [1 ]
Zheng, Zhang [1 ]
Cheng, Shuo [1 ]
Tian, Qi [3 ]
Xu, Yi [1 ]
Yang, Xiaokang [2 ]
Zhang, Wenjun [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[3] Univ Texas San Antonio, San Antonio, TX 78249 USA
基金
中国国家自然科学基金;
关键词
Video caption; representation learning; graphCNN; fine-grained; multiple granularity; SEGMENTATION;
D O I
10.1109/TPAMI.2019.2946823
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative. In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained individual actions, uncovering of spatio-temporal dependency structures of frequent group interactions, and then accurate mapping of these complex interaction details into long and detailed commentary. To explicitly address these challenges, we propose a novel framework Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task. A multi-granular interaction modeling module is proposed to extract among-subjects' interactive actions in a progressive way for encoding both intra- and inter-team interactions. Based on the above multi-granular representations, a multi-granular attention module is developed to consider action/event descriptions of multiple spatio-temporal resolutions. Both modules are integrated seamlessly and work in a collaborative way to generate the final narrative. In the meantime, to facilitate reproducible research, we collect a new video dataset from YouTube.com called Sports Video Narrative dataset (SVN). It is a novel direction as it contains 6 K team sports videos (i.e., NBA basketball games) with 10K ground-truth narratives(e.g., sentences). Furthermore, as previous metrics such as METEOR (i.e., used in coarse-grained video caption task) DO NOT cope with fine-grained sports narrative task well, we hence develop a novel evaluation metric named Fine-grained Captioning Evaluation (FCE), which measures how accurate the generated linguistic description reflects fine-grained action details as well as the overall spatio-temporal interactional structure. Extensive experiments on our SVN dataset have demonstrated the effectiveness of the proposed framework for fine-grained team sports video auto-narrative.
引用
收藏
页码:666 / 683
页数:18
相关论文
共 50 条
  • [41] Fine-grained person-based image captioning via advanced spectrum parsing
    Wu, Jianhui
    Ni, Fan
    Wang, Zijie
    Ju, Haoyu
    Zhang, Yue
    Hu, Fangqiang
    Li, Yifeng
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (11) : 34015 - 34030
  • [42] Towards fine-grained adaptive video captioning via Quality-Aware Recurrent Feedback Network
    Xu, Tianyang
    Zhang, Yunjie
    Song, Xiaoning
    Wu, Xiao-Jun
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 261
  • [43] Fine-grained person-based image captioning via advanced spectrum parsing
    Jianhui Wu
    Fan Ni
    Zijie Wang
    Haoyu Ju
    Yue Zhang
    Fangqiang Hu
    Yifeng Li
    Multimedia Tools and Applications, 2024, 83 : 34015 - 34030
  • [44] MHGEE: Event Extraction via Multi-granularity Heterogeneous Graph
    Zhang, Mingyu
    Fang, Fang
    Li, Hao
    Liu, Qingyun
    Li, Yangchun
    Wang, Hailong
    COMPUTATIONAL SCIENCE - ICCS 2022, PT I, 2022, : 473 - 487
  • [45] Fine-Grained Entity Typing via Hierarchical Multi Graph Convolutional Networks
    Jin, Hailong
    Hou, Lei
    Li, Juanzi
    Dong, Tiansi
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4969 - 4978
  • [46] GraphPyRec: A novel graph-based approach for fine-grained Python']Python code recommendation
    Zong, Xing
    Zheng, Shang
    Zou, Haitao
    Yu, Hualong
    Gao, Shang
    SCIENCE OF COMPUTER PROGRAMMING, 2024, 238
  • [47] A fine-grained vision and language representation framework with graph-based fashion semantic knowledge
    Ding, Huiming
    Wang, Sen
    Xie, Zhifeng
    Li, Mengtian
    Ma, Lizhuang
    COMPUTERS & GRAPHICS-UK, 2023, 115 : 216 - 225
  • [48] Fine-Grained Image Recognition via Multi-Part Learning
    Jiang, Hailang
    Liu, Jianming
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (07): : 1032 - 1039
  • [49] Graph-Based Video-Language Learning with Multi-Grained Audio-Visual Alignment
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Wang, Longyue
    Zhou, Liting
    Gurrin, Cathal
    Yang, Linyi
    Yu, Yi
    Graham, Yvette
    Foster, Jennifer
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3975 - 3984
  • [50] Fine-Grained Predicates Learning for Scene Graph Generation
    Lyu, Xinyu
    Gao, Lianli
    Guo, Yuyu
    Zhao, Zhou
    Huang, Hao
    Shen, Heng Tao
    Song, Jingkuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19445 - 19453