Joint learning of images and videos with a single Vision Transformer

被引:0
|
作者
Shimizu, Shuki [1 ]
Tamaki, Toru [1 ]
机构
[1] Nagoya Inst Technol, Nagoya, Japan
关键词
D O I
10.23919/MVA57639.2023.10215661
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.
引用
收藏
页数:6
相关论文
共 50 条
  • [42] Vision Transformer for Pneumonia Classification in X-ray Images
    Ngoc Ha Pham
    Doucet, Antoine
    Giang Son Tran
    PROCEEDINGS OF 2023 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION TECHNOLOGY, ICIIT 2023, 2023, : 185 - 192
  • [43] Vision Transformer Based Model for Describing a Set of Images as a Story
    Malakan, Zainy M.
    Hassan, Ghulam Mubashar
    Mian, Ajmal
    AI 2022: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, 13728 : 15 - 28
  • [44] A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos
    Abiyev, Rahib H.
    Altabel, Mohamad Ziad
    Darwish, Manal
    Helwan, Abdulkader
    DIAGNOSTICS, 2024, 14 (07)
  • [45] OmniMAE: Single Model Masked Pretraining on Images and Videos
    Girdhar, Rohit
    El-Nouby, Alaaeldin
    Singh, Mannat
    Alwala, Kalyan Vasudev
    Joulin, Armand
    Misra, Ishan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10406 - 10417
  • [46] Multi-task Joint Learning for Videos in the Wild
    Hong, Yong Won
    Kim, Hoseong
    Byun, Hyeran
    PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 27 - 30
  • [47] Joint Representation Learning for Anomaly Detection in Surveillance Videos
    Saypadith, Savath
    Onoye, Takao
    2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022,
  • [48] Joint Distance and Representation Learning for Sign Language Videos
    Kose, Oyku Deniz
    Saraclar, Murat
    2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [49] Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos
    Pei, Pengfei
    Zhao, Xianfeng
    Li, Jinchuan
    Cao, Yun
    Lai, Xuyuan
    Security and Communication Networks, 2023, 2023
  • [50] Scene Retrieval in Soccer Videos by Spatial-temporal Attention with Video Vision Transformer
    Gan, Yaozong
    Togo, Ren
    Ogawa, Takahiro
    Haseyama, Mild
    2022 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN, IEEE ICCE-TW 2022, 2022, : 453 - 454