Dense video captioning using unsupervised semantic information

被引:0
|
作者
Estevam, Valter [1 ,2 ]
Laroca, Rayson [2 ,3 ]
Pedrini, Helio [4 ]
Menotti, David [2 ]
机构
[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil
[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil
[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil
[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil
关键词
Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;
D O I
10.1016/j.jvcir.2024.104385
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Video captioning with stacked attention and semantic hard pull
    Rahman, Md Mushfiqur
    Abedin, Thasin
    Prottoy, Khondokar S. S.
    Moshruba, Ayana
    Siddiqui, Fazlul Hasan
    PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
  • [42] Richer Semantic Visual and Language Representation for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Wang, Hanzhang
    Xu, Kaisheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1871 - 1876
  • [43] Semantic Tag Augmented XlanV Model for Video Captioning
    Huang, Yiqing
    Xue, Hongwei
    Chen, Jiansheng
    Ma, Huimin
    Ma, Hongbing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
  • [44] Audio Captioning with Composition of Acoustic and Semantic Information
    Eren, Aysegul Ozkaya
    Sert, Mustafa
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2021, 15 (02) : 143 - 160
  • [45] Unsupervised Semantic Parsing of Video Collections
    Sener, Ozan
    Zamir, Amir R.
    Savarese, Silvio
    Saxena, Ashutosh
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4480 - 4488
  • [46] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
  • [47] Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
    Qasim, Iqra
    Horsch, Alexander
    Prasad, Dilip
    ACM COMPUTING SURVEYS, 2025, 57 (06)
  • [48] Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
    Wang, Jingwen
    Jiang, Wenhao
    Ma, Lin
    Liu, Wei
    Xu, Yong
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7190 - 7198
  • [49] Exploiting the local temporal information for video captioning
    Wei, Ran
    Mi, Li
    Hu, Yaosi
    Chen, Zhenzhong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 67 (67)
  • [50] A latent topic-aware network for dense video captioning
    Xu, Tao
    Cui, Yuanyuan
    He, Xinyu
    Liu, Caihua
    IET COMPUTER VISION, 2023, 17 (07) : 795 - 803