Joint learning of images and videos with a single Vision Transformer

被引:0
|
作者
Shimizu, Shuki [1 ]
Tamaki, Toru [1 ]
机构
[1] Nagoya Inst Technol, Nagoya, Japan
关键词
D O I
10.23919/MVA57639.2023.10215661
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
    Hussain, Altaf
    Hussain, Tanveer
    Ullah, Waseem
    Baik, Sung Wook
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [2] Unsupervised Learning from Videos for Object Discovery in Single Images
    Zhao, Dong
    Ding, Baoqing
    Wu, Yulin
    Chen, Lei
    Zhou, Hongchao
    SYMMETRY-BASEL, 2021, 13 (01): : 1 - 16
  • [3] Restoring Snow-Degraded Single Images With Wavelet in Vision Transformer
    Agbodike, Obinna
    Chen, Jenhui
    IEEE ACCESS, 2023, 11 : 99470 - 99480
  • [4] Panoramic Vision Transformer for Saliency Detection in 360° Videos
    Yun, Heeseung
    Lee, Sehun
    Kim, Gunhee
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 422 - 439
  • [5] Vision Transformer-Based Tailing Detection in Videos
    Lee, Jaewoo
    Lee, Sungjun
    Cho, Wonki
    Siddiqui, Zahid Ali
    Park, Unsang
    APPLIED SCIENCES-BASEL, 2021, 11 (24):
  • [6] TC-Net: A joint learning framework based on CNN and vision transformer for multi-lesion medical images segmentation
    Zhang, Zhongxiang
    Sun, Guangmin
    Zheng, Kun
    Yang, Jin-Kui
    Zhu, Xiao-rong
    Li, Yu
    COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 161
  • [7] A Latent Transformer for Disentangled Face Editing in Images and Videos
    Yao, Xu
    Newson, Alasdair
    Gousseau, Yann
    Hellier, Pierre
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13769 - 13778
  • [8] CONTINUAL LEARNING IN VISION TRANSFORMER
    Takeda, Mana
    Yanai, Keiji
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 616 - 620
  • [9] A vision transformer for decoding surgeon activity from surgical videos
    Kiyasseh, Dani
    Ma, Runzhuo
    Haque, Taseen F.
    Miles, Brian J.
    Wagner, Christian
    Donoho, Daniel A.
    Anandkumar, Animashree
    Hung, Andrew J.
    NATURE BIOMEDICAL ENGINEERING, 2023, 7 (06) : 780 - +
  • [10] A vision transformer for decoding surgeon activity from surgical videos
    Dani Kiyasseh
    Runzhuo Ma
    Taseen F. Haque
    Brian J. Miles
    Christian Wagner
    Daniel A. Donoho
    Animashree Anandkumar
    Andrew J. Hung
    Nature Biomedical Engineering, 2023, 7 : 780 - 796