Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Context-Guided Spatio-Temporal Video Grounding
    Gu, Xin
    Fan, Heng
    Huang, Yan
    Luo, Tiejian
    Zhang, Libo
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18330 - 18339
  • [22] POET: Prompt Offset Tuning for Continual Human Action Adaptation
    Garg, Prachi
    Joseph, K. J.
    Balasubramanian, Vineeth N.
    Camgoz, Necati Cihan
    Wan, Chengde
    King, Kenrick
    Si, Weiguang
    Ma, Shugao
    De La Torre, Fernando
    COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 436 - 455
  • [23] Action-Guided Attention Mining and Relation Reasoning Network for Human-Object Interaction Detection
    Lin, Xue
    Zou, Qi
    Xu, Xixia
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1104 - 1110
  • [24] Effects of an action-guided intervention on optimistic bias and protective behaviors regarding endocrine disruptors in young women
    Park, SoMi
    Chung, ChaeWeon
    WOMEN & HEALTH, 2022, 62 (03) : 234 - 244
  • [25] Residual Prompt Tuning: Improving Prompt Tuning with Residual Reparameterization
    Razdaibiedina, Anastasia
    Mao, Yuning
    Khabsa, Madian
    Lewis, Mike
    Hou, Rui
    Ba, Jimmy
    Almahairi, Amjad
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6740 - 6757
  • [26] Language-Guided Music Recommendation for Video via Prompt Analogies
    McKee, Daniel
    Salamon, Justin
    Sivic, Josef
    Russell, Bryan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14784 - 14793
  • [27] R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
    Liu, Ye
    He, Jixuan
    Li, Wanhua
    Kim, Junsik
    Wei, Donglai
    Pfister, Hanspeter
    Chen, Chang Wen
    COMPUTER VISION - ECCV 2024, PT XLI, 2025, 15099 : 421 - 438
  • [28] Visual-Language Prompt Tuning with Knowledge-guided Context Optimization
    Yao, Hantao
    Zhang, Rui
    Xu, Changsheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6757 - 6767
  • [29] On Unsupervised Domain Adaptation: Pseudo Label Guided Mixup for Adversarial Prompt Tuning
    Kong, Fanshuang
    Zhang, Richong
    Wang, Ziqiao
    Mao, Yongyi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18399 - 18407
  • [30] Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition
    Wang, Guanhong
    Zhou, Yang
    He, Zhanhao
    Lu, Keyu
    Feng, Yang
    Liu, Zuozhu
    Wang, Gaoang
    NEUROCOMPUTING, 2024, 571