Self-Supervised RGB-NIR Fusion Video Vision Transformer Framework for rPPG Estimation

被引：23

作者：

Park, Soyeon ^{[1
]}

Kim, Bo-Kyeong ^{[2
]}

Dong, Suh-Yeon ^{[1
]}

机构：

[1] Sookmyung Womens Univ, HCI Lab IT Engn, Seoul 04310, South Korea

[2] Nota Inc, Seoul 06212, South Korea

来源：

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT | 2022年 / 71卷

基金：

新加坡国家研究基金会;

关键词：

Near-infrared (NIR); remote heart rate (HR) measurement; remote photoplethysmography (rPPG); RGB; self-supervised learning (SSL); video vision transformer (ViViT); HEART-RATE ESTIMATION; PHOTOPLETHYSMOGRAPHY;

D O I：

10.1109/TIM.2022.3217867

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Remote photoplethysmography (rPPG) is a technology that can estimate noncontact heart rate (HR) using facial videos. Estimating rPPG signals requires low cost, and thus, it is widely used for noncontact health monitoring. Recent HR estimation studies based on rPPG heavily rely on the supervised feature learning on normal RGB videos. However, the RGB-only methods are significantly affected by head movements and various illumination conditions, and it is difficult to obtain large-scale labeled data for rPPG to determine the performance of supervised learning methods. To address these problems, we present the first of its kind self-supervised transformer-based fusion learning framework for rPPG estimation. In our study, we propose an end-to-end fusion video vision transformer (Fusion ViViT) network that can extract long-range local and global spatiotemporal features from videos and convert them into video sequences to enhance the rPPG representation. In addition, the self-attention of the transformer integrates the spatiotemporal representations of complementary RGB and near-infrared (NIR), which, in turn, enable robust HR estimation even under complex conditions. We use contrastive learning as a self-supervised learning (SSL) scheme. We evaluate our framework on public datasets containing both RGB, NIR videos and physiological signals. The result of near-instant HR (approximately 6 s) estimation on the large-scale rPPG dataset with various scenarios was 14.86 of root mean squared error (RMSE), which was competitive with the state-of-the-art accuracy of average HR (approximately 30 s). Furthermore, transfer learning results on the driving rPPG dataset showed a stable HR estimation performance with 16.94 of RMSE, demonstrating that our framework can be utilized in the real world.

引用

页数：10

共 50 条

[21] TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer
Cheng, Zeyu
Zhang, Yi
Yu, Yang
Song, Zhe
Tang, Chengkai
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 138
[22] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
Li, Yuanyuan
Alkhalifah, Tariq
Huang, Jianping
Li, Zhenchun
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[23] Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing
Lu, Kaixuan
Zhang, Ruiqian
Huang, Xiao
Xie, Yuxing
Ning, Xiaogang
Zhang, Hanchao
Yuan, Mengke
Zhang, Pan
Wang, Tao
Liao, Tongkui
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[24] Self-Supervised Pretraining Vision Transformer With Masked Autoencoders for Building Subsurface Model
Li, Yuanyuan
Alkhalifah, Tariq
Huang, Jianping
Li, Zhenchun
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[25] Histopathological Image Classification based on Self-Supervised Vision Transformer and Weak Labels
Gul, Ahmet Gokberk
Cetin, Oezdemir
Reich, Christoph
Flinner, Nadine
Prangemeier, Tim
Koeppl, Heinz
MEDICAL IMAGING 2022: DIGITAL AND COMPUTATIONAL PATHOLOGY, 2022, 12039
[26] Self-supervised approach for diabetic retinopathy severity detection using vision transformer
Ohri, Kriti
Kumar, Mukesh
Sukheja, Deepak
PROGRESS IN ARTIFICIAL INTELLIGENCE, 2024, : 165 - 183
[27] Exploiting temporal coherence for self-supervised visual tracking by using vision transformer
Zhu, Wenjun
Wang, Zuyi
Xu, Li
Meng, Jun
KNOWLEDGE-BASED SYSTEMS, 2022, 251
[28] Self-supervised spatial-temporal transformer fusion based federated framework for 4D cardiovascular
Mazher, Moona
Razzak, Imran
Qayyum, Abdul
Tanveer, M.
Beier, Susann
Khan, Tariq
Niederer, Steven A.
INFORMATION FUSION, 2024, 106
[29] A LIGHTWEIGHT SELF-SUPERVISED TRAINING FRAMEWORK FOR MONOCULAR DEPTH ESTIMATION
Heydrich, Tim
Yang, Yimin
Du, Shan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2265 - 2269
[30] Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection
Li, Maosen
Li, Xurong
Yu, Kun
Deng, Cheng
Huang, Heng
Mao, Feng
Xue, Hui
Li, Minghao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8707 - 8718

← 1 2 3 4 5 →