Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

被引：9

作者：

Yan, Yichao ^{[1
]}

Zhuang, Ning ^{[1
]}

Ni, Bingbing ^{[1
]}

Zhang, Jian ^{[1
]}

Xu, Minghao ^{[1
]}

Zhang, Qiang ^{[1
]}

Zheng, Zhang ^{[1
]}

Cheng, Shuo ^{[1
]}

Tian, Qi ^{[3
]}

Xu, Yi ^{[1
]}

Yang, Xiaokang ^{[2
]}

Zhang, Wenjun ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai 200240, Peoples R China

[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China

[3] Univ Texas San Antonio, San Antonio, TX 78249 USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2022年 / 44卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Video caption; representation learning; graphCNN; fine-grained; multiple granularity; SEGMENTATION;

D O I：

10.1109/TPAMI.2019.2946823

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative. In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained individual actions, uncovering of spatio-temporal dependency structures of frequent group interactions, and then accurate mapping of these complex interaction details into long and detailed commentary. To explicitly address these challenges, we propose a novel framework Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task. A multi-granular interaction modeling module is proposed to extract among-subjects' interactive actions in a progressive way for encoding both intra- and inter-team interactions. Based on the above multi-granular representations, a multi-granular attention module is developed to consider action/event descriptions of multiple spatio-temporal resolutions. Both modules are integrated seamlessly and work in a collaborative way to generate the final narrative. In the meantime, to facilitate reproducible research, we collect a new video dataset from YouTube.com called Sports Video Narrative dataset (SVN). It is a novel direction as it contains 6 K team sports videos (i.e., NBA basketball games) with 10K ground-truth narratives(e.g., sentences). Furthermore, as previous metrics such as METEOR (i.e., used in coarse-grained video caption task) DO NOT cope with fine-grained sports narrative task well, we hence develop a novel evaluation metric named Fine-grained Captioning Evaluation (FCE), which measures how accurate the generated linguistic description reflects fine-grained action details as well as the overall spatio-temporal interactional structure. Extensive experiments on our SVN dataset have demonstrated the effectiveness of the proposed framework for fine-grained team sports video auto-narrative.

引用

页码：666 / 683

页数：18

共 50 条

[21] Element-Centered Multi-granularity Network for Dense Video Captioning
Dane, Xuan
Wang, Guolong
Wu, Xun
Qin, Zheng
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 445 - 459
[22] Fine-Grained Length Controllable Video Captioning With Ordinal Embeddings
Nitta, Tomoya
Fukuzawa, Takumi
Tamaki, Toru
IEEE ACCESS, 2024, 12 : 189667 - 189688
[23] Fine-Grained Question-Level Deception Detection via Graph-Based Learning and Cross-Modal Fusion
Zhang, Huijun
Ding, Yang
Cao, Lei
Wang, Xin
Feng, Ling
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2022, 17 : 2452 - 2467
[24] A multi-granularity parallelism object recognition processor with content-aware fine-grained task scheduling
Park, Junyoung
Hong, Injoon
Kim, Gyeonghoon
Kim, Youchang
Lee, Kyuho
Park, Seongwook
Bong, Kyeongryeol
Yoo, Hoi-Jun
2013 IEEE COOL CHIPS XVI (COOL CHIPS), 2013,
[25] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
Lin X.
Jin Q.
Chen S.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (08): : 1350 - 1357
[26] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
Lin, Xiaozhu
Jin, Qin
Chen, Shizhe
Song, Yuqing
Zhao, Yida
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 78 - 88
[27] Coarse Helps Fine: A Multi-Granularity Discriminative Adversarial Network for Fine-Grained Open-Set Domain Adaptation
Li, Jing
Yang, Liu
Wang, Qilong
Hu, Qinghua
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2675 - 2680
[28] Multi-Granularity Federated Learning by Graph-Partitioning
Dai, Ziming
Zhao, Yunfeng
Qiu, Chao
Wang, Xiaofei
Yao, Haipeng
Niyato, Dusit
IEEE TRANSACTIONS ON CLOUD COMPUTING, 2025, 13 (01) : 18 - 33
[29] Knowledge graph-based multi-granularity tacit design knowledge reuse for product design
Jia, Jia
Zhang, Yingzhong
Saad, Mohamed
JOURNAL OF COMPUTATIONAL DESIGN AND ENGINEERING, 2025, 12 (01) : 53 - 79
[30] Multi-Granularity Contrastive Learning for Graph with Hierarchical Pooling
Liu, Peishuo
Zhou, Cangqi
Liu, Xiao
Zhang, Jing
Li, Qianmu
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IV, 2023, 14257 : 499 - 511

← 1 2 3 4 5 →