Self-Supervised RGB-NIR Fusion Video Vision Transformer Framework for rPPG Estimation

被引：23

作者：

Park, Soyeon ^{[1
]}

Kim, Bo-Kyeong ^{[2
]}

Dong, Suh-Yeon ^{[1
]}

机构：

[1] Sookmyung Womens Univ, HCI Lab IT Engn, Seoul 04310, South Korea

[2] Nota Inc, Seoul 06212, South Korea

来源：

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT | 2022年 / 71卷

基金：

新加坡国家研究基金会;

关键词：

Near-infrared (NIR); remote heart rate (HR) measurement; remote photoplethysmography (rPPG); RGB; self-supervised learning (SSL); video vision transformer (ViViT); HEART-RATE ESTIMATION; PHOTOPLETHYSMOGRAPHY;

D O I：

10.1109/TIM.2022.3217867

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Remote photoplethysmography (rPPG) is a technology that can estimate noncontact heart rate (HR) using facial videos. Estimating rPPG signals requires low cost, and thus, it is widely used for noncontact health monitoring. Recent HR estimation studies based on rPPG heavily rely on the supervised feature learning on normal RGB videos. However, the RGB-only methods are significantly affected by head movements and various illumination conditions, and it is difficult to obtain large-scale labeled data for rPPG to determine the performance of supervised learning methods. To address these problems, we present the first of its kind self-supervised transformer-based fusion learning framework for rPPG estimation. In our study, we propose an end-to-end fusion video vision transformer (Fusion ViViT) network that can extract long-range local and global spatiotemporal features from videos and convert them into video sequences to enhance the rPPG representation. In addition, the self-attention of the transformer integrates the spatiotemporal representations of complementary RGB and near-infrared (NIR), which, in turn, enable robust HR estimation even under complex conditions. We use contrastive learning as a self-supervised learning (SSL) scheme. We evaluate our framework on public datasets containing both RGB, NIR videos and physiological signals. The result of near-instant HR (approximately 6 s) estimation on the large-scale rPPG dataset with various scenarios was 14.86 of root mean squared error (RMSE), which was competitive with the state-of-the-art accuracy of average HR (approximately 30 s). Furthermore, transfer learning results on the driving rPPG dataset showed a stable HR estimation performance with 16.94 of RMSE, demonstrating that our framework can be utilized in the real world.

引用

页数：10

共 50 条

[1] Self-supervised Video Transformer
Ranasinghe, Kanchana
Naseer, Muzammal
Khan, Salman
Khan, Fahad Shahbaz
Ryoo, Michael S.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2864 - 2874
[2] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Zhao, Chaoqiang
Zhang, Youmin
Poggi, Matteo
Tosi, Fabio
Guo, Xianda
Zhu, Zheng
Huang, Guan
Tang, Yang
Mattoccia, Stefano
2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
[3] Positional Label for Self-Supervised Vision Transformer
Zhang, Zhemin
Gong, Xun
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3516 - 3524
[4] Geometrized Transformer for Self-Supervised Homography Estimation
Liu, Jiazhen
Li, Xirong
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9522 - 9531
[5] Multimodal Image Fusion via Self-Supervised Transformer
Zhang, Jing
Liu, Yu
Liu, Aiping
Xie, Qingguo
Ward, Rabab
Wang, Z. Jane
Chen, Xun
IEEE SENSORS JOURNAL, 2023, 23 (09) : 9796 - 9807
[6] Self-Supervised Video-Centralised Transformer for Video Face Clustering
Wang, Yujiang
Dong, Mingzhi
Shen, Jie
Luo, Yiming
Lin, Yiming
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12944 - 12959
[7] Self-supervised multimodal fusion transformer for passive activity recognition
Koupai, Armand K.
Bocus, Mohammud J.
Santos-Rodriguez, Raul
Piechocki, Robert J.
McConville, Ryan
IET WIRELESS SENSOR SYSTEMS, 2022, 12 (5-6) : 149 - 160
[8] STFNet: Self-Supervised Transformer for Infrared and Visible Image Fusion
Liu, Qiao
Pi, Jiatian
Gao, Peng
Yuan, Di
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1513 - 1526
[9] TFDEPTH: SELF-SUPERVISED MONOCULARDEPTH ESTIMATION WITH MULITI-SCALE SELECTIVE TRANSFORMER FEATURE FUSION
Hu, Hongli
Miao, Jun
Zhu, Guanghu
Yan, Je
Chu, Jun
IMAGE ANALYSIS & STEREOLOGY, 2024, 43 (02): : 139 - 149
[10] A Self-Supervised Decision Fusion Framework for Building Detection
Senaras, Caglar
Vural, Fatos T. Yarman
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2016, 9 (05) : 1780 - 1791

← 1 2 3 4 5 →