Dense video captioning using unsupervised semantic information

被引:0
|
作者
Estevam, Valter [1 ,2 ]
Laroca, Rayson [2 ,3 ]
Pedrini, Helio [4 ]
Menotti, David [2 ]
机构
[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil
[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil
[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil
[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil
关键词
Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;
D O I
10.1016/j.jvcir.2024.104385
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [22] TopicDVC: Dense Video Captioning with Topic Guidance
    Chen, Wei
    2024 IEEE 10TH INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD, EDGECOM 2024, 2024, : 82 - 87
  • [23] Global semantic enhancement network for video captioning
    Luo, Xuemei
    Luo, Xiaotong
    Wang, Di
    Liu, Jinhui
    Wan, Bo
    Zhao, Lin
    PATTERN RECOGNITION, 2024, 145
  • [24] Adaptive semantic guidance network for video captioning☆
    Liu, Yuanyuan
    Zhu, Hong
    Wu, Zhong
    Du, Sen
    Wu, Shuning
    Shi, Jing
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 251
  • [25] Chained semantic generation network for video captioning
    Mao L.
    Gao H.
    Yang D.
    Zhang R.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2022, 30 (24): : 3198 - 3209
  • [26] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [27] SEMANTIC LEARNING NETWORK FOR CONTROLLABLE VIDEO CAPTIONING
    Chen, Kaixuan
    Di, Qianji
    Lu, Yang
    Wang, Hanzi
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 880 - 884
  • [28] Discriminative Latent Semantic Graph for Video Captioning
    Bai, Yang
    Wang, Junyan
    Long, Yang
    Hu, Bingzhang
    Song, Yang
    Pagnucco, Maurice
    Guan, Yu
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
  • [29] Visual Relation-Aware Unsupervised Video Captioning
    Ji, Puzhao
    Cao, Meng
    Zou, Yuexian
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
  • [30] Jointly Localizing and Describing Events for Dense Video Captioning
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Chao, Hongyang
    Mei, Tao
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7492 - 7500