Joint learning of images and videos with a single Vision Transformer

被引：0

作者：

Shimizu, Shuki ^{[1
]}

Tamaki, Toru ^{[1
]}

机构：

[1] Nagoya Inst Technol, Nagoya, Japan

来源：

2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA | 2023年

关键词：

D O I：

10.23919/MVA57639.2023.10215661

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

引用

页数：6

共 50 条

[41] Vision transformer based classification of gliomas from histopathological images
Goceri, Evgin
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 241
[42] Vision Transformer for Pneumonia Classification in X-ray Images
Ngoc Ha Pham
Doucet, Antoine
Giang Son Tran
PROCEEDINGS OF 2023 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION TECHNOLOGY, ICIIT 2023, 2023, : 185 - 192
[43] Vision Transformer Based Model for Describing a Set of Images as a Story
Malakan, Zainy M.
Hassan, Ghulam Mubashar
Mian, Ajmal
AI 2022: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, 13728 : 15 - 28
[44] A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos
Abiyev, Rahib H.
Altabel, Mohamad Ziad
Darwish, Manal
Helwan, Abdulkader
DIAGNOSTICS, 2024, 14 (07)
[45] OmniMAE: Single Model Masked Pretraining on Images and Videos
Girdhar, Rohit
El-Nouby, Alaaeldin
Singh, Mannat
Alwala, Kalyan Vasudev
Joulin, Armand
Misra, Ishan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10406 - 10417
[46] Multi-task Joint Learning for Videos in the Wild
Hong, Yong Won
Kim, Hoseong
Byun, Hyeran
PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 27 - 30
[47] Joint Representation Learning for Anomaly Detection in Surveillance Videos
Saypadith, Savath
Onoye, Takao
2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022,
[48] Joint Distance and Representation Learning for Sign Language Videos
Kose, Oyku Deniz
Saraclar, Murat
2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
[49] Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos
Pei, Pengfei
Zhao, Xianfeng
Li, Jinchuan
Cao, Yun
Lai, Xuyuan
Security and Communication Networks, 2023, 2023
[50] Scene Retrieval in Soccer Videos by Spatial-temporal Attention with Video Vision Transformer
Gan, Yaozong
Togo, Ren
Ogawa, Takahiro
Haseyama, Mild
2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 453 - 454

← 1 2 3 4 5 →