Self-Supervised RGB-NIR Fusion Video Vision Transformer Framework for rPPG Estimation

被引:23
|
作者
Park, Soyeon [1 ]
Kim, Bo-Kyeong [2 ]
Dong, Suh-Yeon [1 ]
机构
[1] Sookmyung Womens Univ, HCI Lab IT Engn, Seoul 04310, South Korea
[2] Nota Inc, Seoul 06212, South Korea
基金
新加坡国家研究基金会;
关键词
Near-infrared (NIR); remote heart rate (HR) measurement; remote photoplethysmography (rPPG); RGB; self-supervised learning (SSL); video vision transformer (ViViT); HEART-RATE ESTIMATION; PHOTOPLETHYSMOGRAPHY;
D O I
10.1109/TIM.2022.3217867
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Remote photoplethysmography (rPPG) is a technology that can estimate noncontact heart rate (HR) using facial videos. Estimating rPPG signals requires low cost, and thus, it is widely used for noncontact health monitoring. Recent HR estimation studies based on rPPG heavily rely on the supervised feature learning on normal RGB videos. However, the RGB-only methods are significantly affected by head movements and various illumination conditions, and it is difficult to obtain large-scale labeled data for rPPG to determine the performance of supervised learning methods. To address these problems, we present the first of its kind self-supervised transformer-based fusion learning framework for rPPG estimation. In our study, we propose an end-to-end fusion video vision transformer (Fusion ViViT) network that can extract long-range local and global spatiotemporal features from videos and convert them into video sequences to enhance the rPPG representation. In addition, the self-attention of the transformer integrates the spatiotemporal representations of complementary RGB and near-infrared (NIR), which, in turn, enable robust HR estimation even under complex conditions. We use contrastive learning as a self-supervised learning (SSL) scheme. We evaluate our framework on public datasets containing both RGB, NIR videos and physiological signals. The result of near-instant HR (approximately 6 s) estimation on the large-scale rPPG dataset with various scenarios was 14.86 of root mean squared error (RMSE), which was competitive with the state-of-the-art accuracy of average HR (approximately 30 s). Furthermore, transfer learning results on the driving rPPG dataset showed a stable HR estimation performance with 16.94 of RMSE, demonstrating that our framework can be utilized in the real world.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer
    Cheng, Zeyu
    Zhang, Yi
    Yu, Yang
    Song, Zhe
    Tang, Chengkai
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 138
  • [22] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
    Li, Yuanyuan
    Alkhalifah, Tariq
    Huang, Jianping
    Li, Zhenchun
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [23] Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing
    Lu, Kaixuan
    Zhang, Ruiqian
    Huang, Xiao
    Xie, Yuxing
    Ning, Xiaogang
    Zhang, Hanchao
    Yuan, Mengke
    Zhang, Pan
    Wang, Tao
    Liao, Tongkui
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [24] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
    Li, Yuanyuan
    Alkhalifah, Tariq
    Huang, Jianping
    Li, Zhenchun
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [25] Histopathological Image Classification based on Self-Supervised Vision Transformer and Weak Labels
    Gul, Ahmet Gokberk
    Cetin, Oezdemir
    Reich, Christoph
    Flinner, Nadine
    Prangemeier, Tim
    Koeppl, Heinz
    MEDICAL IMAGING 2022: DIGITAL AND COMPUTATIONAL PATHOLOGY, 2022, 12039
  • [26] Self-supervised approach for diabetic retinopathy severity detection using vision transformer
    Ohri, Kriti
    Kumar, Mukesh
    Sukheja, Deepak
    PROGRESS IN ARTIFICIAL INTELLIGENCE, 2024, : 165 - 183
  • [27] Exploiting temporal coherence for self-supervised visual tracking by using vision transformer
    Zhu, Wenjun
    Wang, Zuyi
    Xu, Li
    Meng, Jun
    KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [28] Self-supervised spatial-temporal transformer fusion based federated framework for 4D cardiovascular
    Mazher, Moona
    Razzak, Imran
    Qayyum, Abdul
    Tanveer, M.
    Beier, Susann
    Khan, Tariq
    Niederer, Steven A.
    INFORMATION FUSION, 2024, 106
  • [29] A LIGHTWEIGHT SELF-SUPERVISED TRAINING FRAMEWORK FOR MONOCULAR DEPTH ESTIMATION
    Heydrich, Tim
    Yang, Yimin
    Du, Shan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2265 - 2269
  • [30] Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection
    Li, Maosen
    Li, Xurong
    Yu, Kun
    Deng, Cheng
    Huang, Heng
    Mao, Feng
    Xue, Hui
    Li, Minghao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8707 - 8718