Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

被引:9
|
作者
Yan, Yichao [1 ]
Zhuang, Ning [1 ]
Ni, Bingbing [1 ]
Zhang, Jian [1 ]
Xu, Minghao [1 ]
Zhang, Qiang [1 ]
Zheng, Zhang [1 ]
Cheng, Shuo [1 ]
Tian, Qi [3 ]
Xu, Yi [1 ]
Yang, Xiaokang [2 ]
Zhang, Wenjun [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[3] Univ Texas San Antonio, San Antonio, TX 78249 USA
基金
中国国家自然科学基金;
关键词
Video caption; representation learning; graphCNN; fine-grained; multiple granularity; SEGMENTATION;
D O I
10.1109/TPAMI.2019.2946823
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative. In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained individual actions, uncovering of spatio-temporal dependency structures of frequent group interactions, and then accurate mapping of these complex interaction details into long and detailed commentary. To explicitly address these challenges, we propose a novel framework Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task. A multi-granular interaction modeling module is proposed to extract among-subjects' interactive actions in a progressive way for encoding both intra- and inter-team interactions. Based on the above multi-granular representations, a multi-granular attention module is developed to consider action/event descriptions of multiple spatio-temporal resolutions. Both modules are integrated seamlessly and work in a collaborative way to generate the final narrative. In the meantime, to facilitate reproducible research, we collect a new video dataset from YouTube.com called Sports Video Narrative dataset (SVN). It is a novel direction as it contains 6 K team sports videos (i.e., NBA basketball games) with 10K ground-truth narratives(e.g., sentences). Furthermore, as previous metrics such as METEOR (i.e., used in coarse-grained video caption task) DO NOT cope with fine-grained sports narrative task well, we hence develop a novel evaluation metric named Fine-grained Captioning Evaluation (FCE), which measures how accurate the generated linguistic description reflects fine-grained action details as well as the overall spatio-temporal interactional structure. Extensive experiments on our SVN dataset have demonstrated the effectiveness of the proposed framework for fine-grained team sports video auto-narrative.
引用
收藏
页码:666 / 683
页数:18
相关论文
共 50 条
  • [31] A Fine-Grained Spatial-Temporal Attention Model for Video Captioning
    Liu, An-An
    Qiu, Yurui
    Wong, Yongkang
    Su, Yu-Ting
    Kankanhalli, Mohan
    IEEE ACCESS, 2018, 6 : 68463 - 68471
  • [32] Fine-Grained Graph Learning for Multi-View Subspace Clustering
    Wang, Yidi
    Pei, Xiaobing
    Zhan, Haoxi
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (04): : 2804 - 2815
  • [33] Graph-based High-Order Relation Discovery for Fine-grained Recognition
    Zhao, Yifan
    Yan, Ke
    Huang, Feiyue
    Li, Jia
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15074 - 15083
  • [34] Multi-Granularity Feature Aggregation with Self-Attention and Spatial Reasoning for Fine-Grained Crop Disease Classification
    Zuo, Xin
    Chu, Jiao
    Shen, Jifeng
    Sun, Jun
    AGRICULTURE-BASEL, 2022, 12 (09):
  • [35] CA-PMG: Channel attention and progressive multi-granularity training network for fine-grained visual classification
    Zhao, Peipei
    Miao, Qiguang
    Yao, Hang
    Liu, Xiangzeng
    Liu, Ruyi
    Gong, Maoguo
    IET IMAGE PROCESSING, 2021, 15 (14) : 3718 - 3727
  • [36] GRAPH FINE-GRAINED CONTRASTIVE REPRESENTATION LEARNING
    Tang, Hui
    Liang, Xun
    Guo, Yuhui
    Zheng, Xiangping
    Wu, Bo
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3478 - 3482
  • [37] Multi-Granularity Ensemble Interaction Graph Modeling for Knowledge Tracing
    Wang, Jing
    Ma, Huifang
    Zhang, Mengyuan
    Zhang, Lei
    Changc, Liang
    KNOWLEDGE-BASED SYSTEMS, 2025, 309
  • [38] Multi-granularity graph pooling for video-based person re-identification
    Pan, Honghu
    Chen, Yongyong
    He, Zhenyu
    NEURAL NETWORKS, 2023, 160 : 22 - 33
  • [39] Leveraging Weighted Fine-Grained Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network
    Verma, Deepali
    Haldar, Arya
    Dutta, Tanima
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2465 - 2473
  • [40] Multi-Granularity Interaction and Integration Network for Video Question Answering
    Wang, Yuanyuan
    Liu, Meng
    Wu, Jianlong
    Nie, Liqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7684 - 7695