Deep Multimodal Fusion for Surgical Feedback Classification

被引:0
|
作者
Kocielnik, Rafal [1 ]
Wong, Elyssa Y. [2 ]
Chu, Timothy N. [2 ]
Lin, Lydia [1 ,2 ]
Huang, De-An [3 ]
Wang, Jiayun [1 ]
Anandkumar, Anima [1 ]
Hung, Andrew J. [4 ]
机构
[1] CALTECH, Pasadena, CA 91125 USA
[2] Univ Southern Calif, Los Angeles, CA USA
[3] NVIDIA, Santa Clara, CA USA
[4] Cedars Sinai Med Ctr, Los Angeles, CA USA
基金
美国国家卫生研究院;
关键词
Surgical feedback; Multimodality; Robot-Assisted Surgery; Deep Learning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
引用
收藏
页码:256 / 267
页数:12
相关论文
共 50 条
  • [1] Interpretation on Deep Multimodal Fusion for Diagnostic Classification
    Xin, Bowen
    Huang, Jing
    Zhou, Yun
    Lu, Jie
    Wang, Xiuying
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [2] A Multimodal Deep Fusion Network for Mobile Traffic Classification
    Ding, Shuai
    Xu, Yifei
    Xu, Hao
    Deng, Haojiang
    Ge, Jingguo
    WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS (WASA 2022), PT II, 2022, 13472 : 384 - 392
  • [3] Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
    Yang, Xiaodong
    Molchanov, Pavlo
    Kautz, Jan
    MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 978 - 987
  • [4] Deep multimodal feature fusion for micro-video classification
    Zhang L.
    Cui T.
    Jing P.
    Su Y.
    Jing, Peiguang (pgjing@tju.edu.cn), 1600, Beijing University of Aeronautics and Astronautics (BUAA) (47): : 478 - 485
  • [5] Late fusion of multimodal deep neural networks for weeds classification
    Vo Hoang Trong
    Gwang-hyun, Yu
    Dang Thanh Vu
    Jin-young, Kim
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2020, 175
  • [6] Multimodal fusion of deep neural networks for transmission line flashover classification
    Feng, Bo
    Zhang, Wei
    Xia, Xiaofei
    Tang, Jie
    Liu, Peng
    INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2024,
  • [7] Deep Symmetric Fusion Transformer for Multimodal Remote Sensing Data Classification
    Chang, Honghao
    Bi, Haixia
    Li, Fan
    Xu, Chen
    Chanussot, Jocelyn
    Hong, Danfeng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [8] Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification
    Yang, Fan
    Peng, Xiaochang
    Ghosh, Gargi
    Shilon, Reshef
    Ma, Hao
    Moore, Eider
    Predovic, Goran
    THIRD WORKSHOP ON ABUSIVE LANGUAGE ONLINE, 2019, : 11 - 18
  • [9] Multimodal feature fusion in deep learning for comprehensive dental condition classification
    Hsieh, Shang-Ting
    Cheng, Ya-Ai
    JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY, 2024, 32 (02) : 303 - 321
  • [10] Multiview Multimodal Feature Fusion for Breast Cancer Classification Using Deep Learning
    Hussain, Sadam
    Teevno, Mansoor Ali
    Naseem, Usman
    Avalos, Daly Betzabeth Avendano
    Cardona-Huerta, Servando
    Tamez-Pena, Jose Gerardo
    IEEE ACCESS, 2025, 13 : 9265 - 9275