Unsupervised Temporal Video Grounding with Deep Semantic Clustering

被引:0
|
作者
Liu, Daizong [1 ,2 ]
Qu, Xiaoye [2 ]
Wang, Yinzhen [3 ]
Di, Xing [4 ]
Zou, Kai [4 ]
Cheng, Yu [5 ]
Xu, Zichuan [6 ]
Zhou, Pan [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Hubei Engn Res Ctr Big Data Secur, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Hubei, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Hubei, Peoples R China
[4] ProtagoLabs Inc, Vienna, Austria
[5] Microsoft Res, Redmond, WA USA
[6] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
引用
收藏
页码:1683 / 1691
页数:9
相关论文
共 50 条
  • [41] Collaborative Debias Strategy for Temporal Sentence Grounding in Video
    Qi, Zhaobo
    Yuan, Yibo
    Ruan, Xiaowen
    Wang, Shuhui
    Zhang, Weigang
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 10972 - 10986
  • [42] Deep Unsupervised Hashing with Selective Semantic Mining
    Zhao, Chuang
    Ling, Hefei
    Shi, Yuxuan
    Zhao, Chengxin
    Chen, Jiazhong
    Cao, Qiang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 96 - 101
  • [43] DEEP UNSUPERVISED HASHING WITH SEMANTIC CONSISTENCY LEARNING
    Zhao, Chuang
    Lu, Shijie
    Ling, Hefei
    Shi, Yuxuan
    Gu, Bo
    Li, Ping
    Cao, Qiang
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1380 - 1384
  • [44] Deep Unsupervised Hashing with Latent Semantic Components
    Lin, Qinghong
    Chen, Xiaojun
    Zhang, Qin
    Cai, Shaotian
    Zhao, Wenzhe
    Wang, Hongfa
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 7488 - 7496
  • [45] Semantic Guided Deep Unsupervised Image Segmentation
    Saha, Sudipan
    Sudhakaran, Swathikiran
    Banerjee, Biplab
    Pendurkar, Sumedh
    IMAGE ANALYSIS AND PROCESSING - ICIAP 2019, PT II, 2019, 11752 : 499 - 510
  • [46] Unsupervised learning of visual and semantic features for video summarization
    Huang, Yansen
    Zhong, Rui
    Yao, Wenjin
    Wang, Rui
    2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [47] Semantic-Aware Contrastive Learning With Proposal Suppression for Video Semantic Role Grounding
    Liu, Meng
    Zhou, Di
    Guo, Jie
    Luo, Xin
    Gao, Zan
    Nie, Liqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 3003 - 3016
  • [48] Dense video captioning using unsupervised semantic information
    Estevam, Valter
    Laroca, Rayson
    Pedrini, Helio
    Menotti, David
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [49] Unsupervised mining of statistical temporal structures in video
    Xie, LX
    Chang, SF
    Divakaran, A
    Sun, HF
    VIDEO MINING, 2003, 6 : 279 - 307
  • [50] Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos
    Yuan, Yitian
    Ma, Lin
    Wang, Jingwen
    Liu, Wei
    Zhu, Wenwu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (05) : 2725 - 2741