Dense video captioning using unsupervised semantic information

被引:0
|
作者
Estevam, Valter [1 ,2 ]
Laroca, Rayson [2 ,3 ]
Pedrini, Helio [4 ]
Menotti, David [2 ]
机构
[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil
[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil
[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil
[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil
关键词
Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;
D O I
10.1016/j.jvcir.2024.104385
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Step by Step: A Gradual Approach for Dense Video Captioning
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2023, 11 : 51949 - 51959
  • [32] Post-Attention Modulator for Dense Video Captioning
    Guo, Zixin
    Wang, Tzu-Jui Julius
    Laaksonen, Jorma
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1536 - 1542
  • [33] Position embedding fusion on transformer for dense video captioning
    Yang, Sixuan
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
  • [34] Parallel Pathway Dense Video Captioning With Deformable Transformer
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2022, 10 : 129899 - 129910
  • [35] Leveraging auxiliary image descriptions for dense video captioning
    Boran, Emre
    Erdem, Aykut
    Ikizler-Cinbis, Nazli
    Erdem, Erkut
    Madhyastha, Pranava
    Specia, Lucia
    PATTERN RECOGNITION LETTERS, 2021, 146 : 70 - 76
  • [36] From Image Captioning to Video Summary using Deep Recurrent Networks and Unsupervised Segmentation
    Morosanu, Bogdan-Andrei
    Lemnaru, Camelia
    TENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2017), 2018, 10696
  • [37] Dense Unsupervised Learning for Video Segmentation
    Araslanov, Nikita
    Schaub-Meyer, Simone
    Roth, Stefan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [38] Attentive Visual Semantic Specialized Network for Video Captioning
    Perez-Martin, Jesus
    Bustos, Benjamin
    Perez, Jorge
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774
  • [39] Video Captioning Method Based on Semantic Topic Association
    Fu, Yan
    Yang, Ying
    Ye, Ou
    ELECTRONICS, 2025, 14 (05):
  • [40] Structured Encoding Based on Semantic Disambiguation for Video Captioning
    Sun, Bo
    Tian, Jinyu
    Wu, Yong
    Yu, Lunjun
    Tang, Yuanyan
    COGNITIVE COMPUTATION, 2024, 16 (03) : 1032 - 1048