Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

被引:9
|
作者
Yan, Yichao [1 ]
Zhuang, Ning [1 ]
Ni, Bingbing [1 ]
Zhang, Jian [1 ]
Xu, Minghao [1 ]
Zhang, Qiang [1 ]
Zheng, Zhang [1 ]
Cheng, Shuo [1 ]
Tian, Qi [3 ]
Xu, Yi [1 ]
Yang, Xiaokang [2 ]
Zhang, Wenjun [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[3] Univ Texas San Antonio, San Antonio, TX 78249 USA
基金
中国国家自然科学基金;
关键词
Video caption; representation learning; graphCNN; fine-grained; multiple granularity; SEGMENTATION;
D O I
10.1109/TPAMI.2019.2946823
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative. In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained individual actions, uncovering of spatio-temporal dependency structures of frequent group interactions, and then accurate mapping of these complex interaction details into long and detailed commentary. To explicitly address these challenges, we propose a novel framework Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task. A multi-granular interaction modeling module is proposed to extract among-subjects' interactive actions in a progressive way for encoding both intra- and inter-team interactions. Based on the above multi-granular representations, a multi-granular attention module is developed to consider action/event descriptions of multiple spatio-temporal resolutions. Both modules are integrated seamlessly and work in a collaborative way to generate the final narrative. In the meantime, to facilitate reproducible research, we collect a new video dataset from YouTube.com called Sports Video Narrative dataset (SVN). It is a novel direction as it contains 6 K team sports videos (i.e., NBA basketball games) with 10K ground-truth narratives(e.g., sentences). Furthermore, as previous metrics such as METEOR (i.e., used in coarse-grained video caption task) DO NOT cope with fine-grained sports narrative task well, we hence develop a novel evaluation metric named Fine-grained Captioning Evaluation (FCE), which measures how accurate the generated linguistic description reflects fine-grained action details as well as the overall spatio-temporal interactional structure. Extensive experiments on our SVN dataset have demonstrated the effectiveness of the proposed framework for fine-grained team sports video auto-narrative.
引用
收藏
页码:666 / 683
页数:18
相关论文
共 50 条
  • [21] Element-Centered Multi-granularity Network for Dense Video Captioning
    Dane, Xuan
    Wang, Guolong
    Wu, Xun
    Qin, Zheng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 445 - 459
  • [22] Fine-Grained Length Controllable Video Captioning With Ordinal Embeddings
    Nitta, Tomoya
    Fukuzawa, Takumi
    Tamaki, Toru
    IEEE ACCESS, 2024, 12 : 189667 - 189688
  • [23] Fine-Grained Question-Level Deception Detection via Graph-Based Learning and Cross-Modal Fusion
    Zhang, Huijun
    Ding, Yang
    Cao, Lei
    Wang, Xin
    Feng, Ling
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2022, 17 : 2452 - 2467
  • [24] A multi-granularity parallelism object recognition processor with content-aware fine-grained task scheduling
    Park, Junyoung
    Hong, Injoon
    Kim, Gyeonghoon
    Kim, Youchang
    Lee, Kyuho
    Park, Seongwook
    Bong, Kyeongryeol
    Yoo, Hoi-Jun
    2013 IEEE COOL CHIPS XVI (COOL CHIPS), 2013,
  • [25] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
    Lin X.
    Jin Q.
    Chen S.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (08): : 1350 - 1357
  • [26] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
    Lin, Xiaozhu
    Jin, Qin
    Chen, Shizhe
    Song, Yuqing
    Zhao, Yida
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 78 - 88
  • [27] Coarse Helps Fine: A Multi-Granularity Discriminative Adversarial Network for Fine-Grained Open-Set Domain Adaptation
    Li, Jing
    Yang, Liu
    Wang, Qilong
    Hu, Qinghua
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2675 - 2680
  • [28] Multi-Granularity Federated Learning by Graph-Partitioning
    Dai, Ziming
    Zhao, Yunfeng
    Qiu, Chao
    Wang, Xiaofei
    Yao, Haipeng
    Niyato, Dusit
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2025, 13 (01) : 18 - 33
  • [29] Knowledge graph-based multi-granularity tacit design knowledge reuse for product design
    Jia, Jia
    Zhang, Yingzhong
    Saad, Mohamed
    JOURNAL OF COMPUTATIONAL DESIGN AND ENGINEERING, 2025, 12 (01) : 53 - 79
  • [30] Multi-Granularity Contrastive Learning for Graph with Hierarchical Pooling
    Liu, Peishuo
    Zhou, Cangqi
    Liu, Xiao
    Zhang, Jing
    Li, Qianmu
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IV, 2023, 14257 : 499 - 511