Self-Supervised RGB-NIR Fusion Video Vision Transformer Framework for rPPG Estimation

被引:23
|
作者
Park, Soyeon [1 ]
Kim, Bo-Kyeong [2 ]
Dong, Suh-Yeon [1 ]
机构
[1] Sookmyung Womens Univ, HCI Lab IT Engn, Seoul 04310, South Korea
[2] Nota Inc, Seoul 06212, South Korea
基金
新加坡国家研究基金会;
关键词
Near-infrared (NIR); remote heart rate (HR) measurement; remote photoplethysmography (rPPG); RGB; self-supervised learning (SSL); video vision transformer (ViViT); HEART-RATE ESTIMATION; PHOTOPLETHYSMOGRAPHY;
D O I
10.1109/TIM.2022.3217867
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote photoplethysmography (rPPG) is a technology that can estimate noncontact heart rate (HR) using facial videos. Estimating rPPG signals requires low cost, and thus, it is widely used for noncontact health monitoring. Recent HR estimation studies based on rPPG heavily rely on the supervised feature learning on normal RGB videos. However, the RGB-only methods are significantly affected by head movements and various illumination conditions, and it is difficult to obtain large-scale labeled data for rPPG to determine the performance of supervised learning methods. To address these problems, we present the first of its kind self-supervised transformer-based fusion learning framework for rPPG estimation. In our study, we propose an end-to-end fusion video vision transformer (Fusion ViViT) network that can extract long-range local and global spatiotemporal features from videos and convert them into video sequences to enhance the rPPG representation. In addition, the self-attention of the transformer integrates the spatiotemporal representations of complementary RGB and near-infrared (NIR), which, in turn, enable robust HR estimation even under complex conditions. We use contrastive learning as a self-supervised learning (SSL) scheme. We evaluate our framework on public datasets containing both RGB, NIR videos and physiological signals. The result of near-instant HR (approximately 6 s) estimation on the large-scale rPPG dataset with various scenarios was 14.86 of root mean squared error (RMSE), which was competitive with the state-of-the-art accuracy of average HR (approximately 30 s). Furthermore, transfer learning results on the driving rPPG dataset showed a stable HR estimation performance with 16.94 of RMSE, demonstrating that our framework can be utilized in the real world.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Self-supervised Video Transformer
    Ranasinghe, Kanchana
    Naseer, Muzammal
    Khan, Salman
    Khan, Fahad Shahbaz
    Ryoo, Michael S.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2864 - 2874
  • [2] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
    Zhao, Chaoqiang
    Zhang, Youmin
    Poggi, Matteo
    Tosi, Fabio
    Guo, Xianda
    Zhu, Zheng
    Huang, Guan
    Tang, Yang
    Mattoccia, Stefano
    2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
  • [3] Positional Label for Self-Supervised Vision Transformer
    Zhang, Zhemin
    Gong, Xun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3516 - 3524
  • [4] Geometrized Transformer for Self-Supervised Homography Estimation
    Liu, Jiazhen
    Li, Xirong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9522 - 9531
  • [5] Multimodal Image Fusion via Self-Supervised Transformer
    Zhang, Jing
    Liu, Yu
    Liu, Aiping
    Xie, Qingguo
    Ward, Rabab
    Wang, Z. Jane
    Chen, Xun
    IEEE SENSORS JOURNAL, 2023, 23 (09) : 9796 - 9807
  • [6] Self-Supervised Video-Centralised Transformer for Video Face Clustering
    Wang, Yujiang
    Dong, Mingzhi
    Shen, Jie
    Luo, Yiming
    Lin, Yiming
    Ma, Pingchuan
    Petridis, Stavros
    Pantic, Maja
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12944 - 12959
  • [7] Self-supervised multimodal fusion transformer for passive activity recognition
    Koupai, Armand K.
    Bocus, Mohammud J.
    Santos-Rodriguez, Raul
    Piechocki, Robert J.
    McConville, Ryan
    IET WIRELESS SENSOR SYSTEMS, 2022, 12 (5-6) : 149 - 160
  • [8] STFNet: Self-Supervised Transformer for Infrared and Visible Image Fusion
    Liu, Qiao
    Pi, Jiatian
    Gao, Peng
    Yuan, Di
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1513 - 1526
  • [9] TFDEPTH: SELF-SUPERVISED MONOCULARDEPTH ESTIMATION WITH MULITI-SCALE SELECTIVE TRANSFORMER FEATURE FUSION
    Hu, Hongli
    Miao, Jun
    Zhu, Guanghu
    Yan, Je
    Chu, Jun
    IMAGE ANALYSIS & STEREOLOGY, 2024, 43 (02): : 139 - 149
  • [10] A Self-Supervised Decision Fusion Framework for Building Detection
    Senaras, Caglar
    Vural, Fatos T. Yarman
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2016, 9 (05) : 1780 - 1791