Deep Multimodal Fusion for Surgical Feedback Classification

被引:0
|
作者
Kocielnik, Rafal [1 ]
Wong, Elyssa Y. [2 ]
Chu, Timothy N. [2 ]
Lin, Lydia [1 ,2 ]
Huang, De-An [3 ]
Wang, Jiayun [1 ]
Anandkumar, Anima [1 ]
Hung, Andrew J. [4 ]
机构
[1] CALTECH, Pasadena, CA 91125 USA
[2] Univ Southern Calif, Los Angeles, CA USA
[3] NVIDIA, Santa Clara, CA USA
[4] Cedars Sinai Med Ctr, Los Angeles, CA USA
基金
美国国家卫生研究院;
关键词
Surgical feedback; Multimodality; Robot-Assisted Surgery; Deep Learning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
引用
收藏
页码:256 / 267
页数:12
相关论文
共 50 条
  • [41] Multimodal Deep Learning via Late Fusion for Non-destructive Papaya Fruit Maturity Classification
    Garillos-Manliguez, Cinmayii A.
    Chiang, John Y.
    2021 18TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL (CCE 2021), 2021,
  • [42] MultiFusionNet: multilayer multimodal fusion of deep neural networks for chest X-ray image classification
    Agarwal, Saurabh
    Arya, K.V.
    Meena, Yogesh Kumar
    Soft Computing, 2024, 28 (19) : 11535 - 11551
  • [43] Speech Intention Classification with Multimodal Deep Learning
    Gu, Yue
    Li, Xinyu
    Chen, Shuhong
    Zhang, Jianyu
    Marsic, Ivan
    ADVANCES IN ARTIFICIAL INTELLIGENCE, CANADIAN AI 2017, 2017, 10233 : 260 - 271
  • [44] Deep Multimodal Guidance for Medical Image Classification
    Mallya, Mayur
    Hamarneh, Ghassan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 298 - 308
  • [45] Multimodal deep representation learning for video classification
    Haiman Tian
    Yudong Tao
    Samira Pouyanfar
    Shu-Ching Chen
    Mei-Ling Shyu
    World Wide Web, 2019, 22 : 1325 - 1341
  • [46] Multimodal deep representation learning for video classification
    Tian, Haiman
    Tao, Yudong
    Pouyanfar, Samira
    Chen, Shu-Ching
    Shyu, Mei-Ling
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (03): : 1325 - 1341
  • [47] A DEEP MULTIMODAL APPROACH FOR MAP IMAGE CLASSIFICATION
    Sawada, Tomoya
    Katsurai, Marie
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4457 - 4461
  • [48] Conditioned Hidden Markov Model Fusion for Multimodal Classification
    Glodek, Michael
    Scherer, Stefan
    Schwenker, Friedhelm
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2280 - 2283
  • [49] Multi-attention Fusion for Multimodal Sentiment Classification
    Li, Guangmin
    Zeng, Xin
    Chen, Chi
    Zhou, Long
    PROCEEDINGS OF 2024 ACM ICMR WORKSHOP ON MULTIMODAL VIDEO RETRIEVAL, ICMR-MVR 2024, 2024, : 1 - 7
  • [50] Fusion of Learned Representations for Multimodal Sensor Data Classification
    Hinkle, Lee B.
    Atkinson, Gentry
    Metsis, Vangelis
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2023, PT I, 2023, 675 : 404 - 415