Dense video captioning using unsupervised semantic information

被引：0

作者：

Estevam, Valter ^{[1
,2
]}

Laroca, Rayson ^{[2
,3
]}

Pedrini, Helio ^{[4
]}

Menotti, David ^{[2
]}

机构：

[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil

[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil

[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil

[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil

来源：

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION | 2025年 / 107卷

关键词：

Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;

D O I：

10.1016/j.jvcir.2024.104385

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

引用

页数：10

共 50 条

[41] Video captioning with stacked attention and semantic hard pull
Rahman, Md Mushfiqur
Abedin, Thasin
Prottoy, Khondokar S. S.
Moshruba, Ayana
Siddiqui, Fazlul Hasan
PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
[42] Richer Semantic Visual and Language Representation for Video Captioning
Tang, Pengjie
Wang, Hanli
Wang, Hanzhang
Xu, Kaisheng
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1871 - 1876
[43] Semantic Tag Augmented XlanV Model for Video Captioning
Huang, Yiqing
Xue, Hongwei
Chen, Jiansheng
Ma, Huimin
Ma, Hongbing
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
[44] Audio Captioning with Composition of Acoustic and Semantic Information
Eren, Aysegul Ozkaya
Sert, Mustafa
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2021, 15 (02) : 143 - 160
[45] Unsupervised Semantic Parsing of Video Collections
Sener, Ozan
Zamir, Amir R.
Savarese, Silvio
Saxena, Ashutosh
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4480 - 4488
[46] End-to-End Dense Video Captioning with Masked Transformer
Zhou, Luowei
Zhou, Yingbo
Corso, Jason J.
Socher, Richard
Xiong, Caiming
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
[47] Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
Qasim, Iqra
Horsch, Alexander
Prasad, Dilip
ACM COMPUTING SURVEYS, 2025, 57 (06)
[48] Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
Wang, Jingwen
Jiang, Wenhao
Ma, Lin
Liu, Wei
Xu, Yong
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7190 - 7198
[49] Exploiting the local temporal information for video captioning
Wei, Ran
Mi, Li
Hu, Yaosi
Chen, Zhenzhong
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 67 (67)
[50] A latent topic-aware network for dense video captioning
Xu, Tao
Cui, Yuanyuan
He, Xinyu
Liu, Caihua
IET COMPUTER VISION, 2023, 17 (07) : 795 - 803

← 1 2 3 4 5 →