Dense video captioning using unsupervised semantic information

被引：0

作者：

Estevam, Valter ^{[1
,2
]}

Laroca, Rayson ^{[2
,3
]}

Pedrini, Helio ^{[4
]}

Menotti, David ^{[2
]}

机构：

[1] Fed Inst Parana, BR-84507302 Irati, PR, Brazil

[2] Univ Fed Parana, Dept Informat, BR-81531970 Curitiba, PR, Brazil

[3] Pontificia Univ Catolica Parana, Postgrad Program Informat, BR-80215901 Curitiba, PR, Brazil

[4] Univ Estadual Campinas, Inst Comp, BR-13083852 Campinas, SP, Brazil

来源：

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION | 2025年 / 107卷

关键词：

Visual similarity; Unsupervised learning; Co-occurrence estimation; Self-attention; Bi-modal attention;

D O I：

10.1016/j.jvcir.2024.104385

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

引用

页数：10

共 50 条

[1] Semantic similarity information discrimination for video captioning
Du, Sen
Zhu, Hong
Xiong, Ge
Lin, Guangfeng
Wang, Dong
Shi, Jing
Wang, Jing
Xing, Nan
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 213
[2] Dense Video Captioning With Early Linguistic Information Fusion
Aafaq, Nayyer
Mian, Ajmal
Akhtar, Naveed
Liu, Wei
Shah, Mubarak
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
[3] Video Captioning with Semantic Information from the Knowledge Base
Wang, Dan
Song, Dandan
2017 IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (IEEE ICBK 2017), 2017, : 224 - 229
[4] Survey of Dense Video Captioning
Huang, Xiankai
Zhang, Jiayu
Wang, Xinyu
Wang, Xiaochuan
Liu, Ruijun
Computer Engineering and Applications, 2023, 59 (12): : 28 - 48
[5] Streamlined Dense Video Captioning
Mun, Jonghwan
Yang, Linjie
Ren, Zhou
Xu, Ning
Han, Bohyung
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3581 - +
[6] Video Captioning with Semantic Guiding
Yuan, Jin
Tian, Chunna
Zhang, Xiangnan
Ding, Yuxuan
Wei, Wei
2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
[7] Weakly Supervised Dense Video Captioning
Shen, Zhiqiang
Li, Jianguo
Su, Zhou
Li, Minjun
Chen, Yurong
Jiang, Yu-Gang
Xue, Xiangyang
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167
[8] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[9] An Efficient Framework for Dense Video Captioning
Suin, Maitreya
Rajagopalan, A. N.
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046
[10] Dense Video Captioning for Incomplete Videos
Dang, Xuan
Wang, Guolong
Xiong, Kun
Qin, Zheng
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 665 - 676

← 1 2 3 4 5 →