Action-guided prompt tuning for video grounding

被引：0

作者：

Wang, Jing ^{[1
]}

Tsao, Raymon ^{[2
]}

Wang, Xuan ^{[1
]}

Wang, Xiaojie ^{[1
]}

Feng, Fangxiang ^{[1
]}

Tian, Shiyu ^{[1
]}

Poria, Soujanya ^{[3
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China

[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China

[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore

来源：

INFORMATION FUSION | 2025年 / 113卷

基金：

中国国家自然科学基金;

关键词：

video grounding; Multi-modal learning; Prompt tuning; Temporal information;

D O I：

10.1016/j.inffus.2024.102577

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.

引用

页数：10

共 50 条

[21] Context-Guided Spatio-Temporal Video Grounding
Gu, Xin
Fan, Heng
Huang, Yan
Luo, Tiejian
Zhang, Libo
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18330 - 18339
[22] POET: Prompt Offset Tuning for Continual Human Action Adaptation
Garg, Prachi
Joseph, K. J.
Balasubramanian, Vineeth N.
Camgoz, Necati Cihan
Wan, Chengde
King, Kenrick
Si, Weiguang
Ma, Shugao
De La Torre, Fernando
COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 436 - 455
[23] Action-Guided Attention Mining and Relation Reasoning Network for Human-Object Interaction Detection
Lin, Xue
Zou, Qi
Xu, Xixia
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1104 - 1110
[24] Effects of an action-guided intervention on optimistic bias and protective behaviors regarding endocrine disruptors in young women
Park, SoMi
Chung, ChaeWeon
WOMEN & HEALTH, 2022, 62 (03) : 234 - 244
[25] Residual Prompt Tuning: Improving Prompt Tuning with Residual Reparameterization
Razdaibiedina, Anastasia
Mao, Yuning
Khabsa, Madian
Lewis, Mike
Hou, Rui
Ba, Jimmy
Almahairi, Amjad
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6740 - 6757
[26] Language-Guided Music Recommendation for Video via Prompt Analogies
McKee, Daniel
Salamon, Justin
Sivic, Josef
Russell, Bryan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14784 - 14793
[27] R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Liu, Ye
He, Jixuan
Li, Wanhua
Kim, Junsik
Wei, Donglai
Pfister, Hanspeter
Chen, Chang Wen
COMPUTER VISION - ECCV 2024, PT XLI, 2025, 15099 : 421 - 438
[28] Visual-Language Prompt Tuning with Knowledge-guided Context Optimization
Yao, Hantao
Zhang, Rui
Xu, Changsheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6757 - 6767
[29] On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning
Kong, Fanshuang
Zhang, Richong
Wang, Ziqiao
Mao, Yongyi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18399 - 18407
[30] Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition
Wang, Guanhong
Zhou, Yang
He, Zhanhao
Lu, Keyu
Feng, Yang
Liu, Zuozhu
Wang, Gaoang
NEUROCOMPUTING, 2024, 571

← 1 2 3 4 5 →