SVT-SDE: Spatiotemporal Vision Transformers-Based Self-Supervised Depth Estimation in Stereoscopic Surgical Videos

被引：6

作者：

Tao, Rong ^{[1
]}

Huang, Baoru ^{[2
]}

Zou, Xiaoyang ^{[1
]}

Zheng, Guoyan ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Biomed Engn, Inst Med Robot, Shanghai 200240, Peoples R China

[2] Imperial Coll London, Hamlyn Ctr Robot Surg, Dept Surg & Canc, London SW7 2AZ, England

来源：

IEEE TRANSACTIONS ON MEDICAL ROBOTICS AND BIONICS | 2023年 / 5卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Estimation; Image reconstruction; Videos; Surgery; Spatiotemporal phenomena; Feature extraction; Cameras; Depth estimation; surgical videos; spatiotemporal vision transformers; unsupervised; DEFORMATION RECOVERY; RECONSTRUCTION; NETWORKS; SURGERY;

D O I：

10.1109/TMRB.2023.3237867

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

Dense depth estimation plays a crucial role in developing context-aware computer-assisted intervention systems. However, it is a challenging task due to low image quality and highly dynamic surgical environment. The task is further complicated by the difficulty in acquiring per-pixel ground truth depth data in a surgical setting. Recent works on self-supervised depth estimation use image reconstruction (i.e., the warped images) as supervisory signal, which helps to eliminate the requirement of ground truth depth annotations but also causes over-smoothed depth predictions. Additionally, most existing depth estimation methods are built upon static laparoscopic images, ignoring rich temporal information. To address these challenges, we propose a novel spatiotemporal vision transformers-based self-supervised depth estimation method, referred as SVT-SDE. Unlike previous works, SVT-SDE features a novel spatiotemporal vision transformers (SVT) architecture, which can learn complementary visual and temporal information from the input stereoscopic video clips. We further introduce high-frequency-based supervisory signal, which helps to preserve fine-grained details of depth estimation. Results from experiments conducted on two publicly available datasets demonstrate the superior performance of SVT-SDE over the state-of-the-art self-supervised depth estimation methods.

引用

页码：42 / 53

页数：12

共 50 条

[1] Exploring Efficiency of Vision Transformers for Self-Supervised Monocular Depth Estimation
Karpov, Aleksei
Makarov, Ilya
2022 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY (ISMAR 2022), 2022, : 711 - 719
[2] Adaptive Self-supervised Depth Estimation in Monocular Videos
Mendoza, Julio
Pedrini, Helio
IMAGE AND GRAPHICS (ICIG 2021), PT III, 2021, 12890 : 687 - 699
[3] Self-Supervised Human Depth Estimation from Monocular Videos
Tan, Feitong
Zhu, Hao
Cui, Zhaopeng
Zhu, Siyu
Pollefeys, Marc
Tan, Ping
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 647 - 656
[4] Spatially variant biases considered self-supervised depth estimation based on laparoscopic videos
Li, Wenda
Hayashi, Yuichiro
Oda, Masahiro
Kitasaka, Takayuki
Misawa, Kazunari
Mori, Kensaku
COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2022, 10 (03): : 274 - 282
[5] Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics
Varma, Arnav
Chawla, Hemang
Zonooz, Bahram
Arani, Elahe
PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 4, 2022, : 758 - 769
[6] SELF-SUPERVISED DEPTH ESTIMATION VIA IMPLICIT CUES FROM VIDEOS
Wang, Jianrong
Zhang, Ge
Wu, Zhenyu
Li, Xuewei
Liu, Li
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2485 - 2489
[7] Self-supervised monocular depth estimation from oblique UAV videos
Madhuanand, Logambal
Nex, Francesco
Yang, Michael Ying
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2021, 176 : 1 - 14
[8] Depth Estimation for Colonoscopy Images with Self-supervised Learning from Videos
Cheng, Kai
Ma, Yiting
Sun, Bin
Li, Yang
Chen, Xuejin
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT VI, 2021, 12906 : 119 - 128
[9] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Zhao, Chaoqiang
Zhang, Youmin
Poggi, Matteo
Tosi, Fabio
Guo, Xianda
Zhu, Zheng
Huang, Guan
Tang, Yang
Mattoccia, Stefano
2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
[10] TSD-Depth: Using transformers and self-distilling for self-supervised indoor depth estimation
Lv C.
Han C.
Chen J.
Cheng D.
Qian J.
Optik, 2023, 288

← 1 2 3 4 5 →