Multimodal Machine Learning: A Survey and Taxonomy

被引:2031
|
作者
Baltrusaitis, Tadas [1 ]
Ahuja, Chaitanya [2 ]
Morency, Louis-Philippe [2 ]
机构
[1] Microsoft Corp, Cambridge CB1 2FB, England
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Multimodal; machine learning; introductory; survey; EMOTION RECOGNITION; NEURAL-NETWORKS; SPEECH; TEXT; FUSION; VIDEO; LANGUAGE; MODELS; GENERATION; ALIGNMENT;
D O I
10.1109/TPAMI.2018.2798607
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
引用
收藏
页码:423 / 443
页数:21
相关论文
共 50 条
  • [11] Machine learning approaches for active queue management: A survey, taxonomy, and future directions
    Toopchinezhad, Mohammad Parsa
    Ahmadi, Mahmood
    COMPUTER NETWORKS, 2025, 262
  • [12] DDoS attacks and machine-learning-based detection methods: A survey and taxonomy
    Najafimehr, Mohammad
    Zarifzadeh, Sajjad
    Mostafavi, Seyedakbar
    ENGINEERING REPORTS, 2023, 5 (12)
  • [13] A survey on multimodal bidirectional machine learning translation of image and natural language processing
    Nam, Wongyung
    Jang, Beakcheol
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 235
  • [14] Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects
    Warner, Elisa
    Lee, Joonsang
    Hsu, William
    Syeda-Mahmood, Tanveer
    Kahn Jr, Charles E.
    Gevaert, Olivier
    Rao, Arvind
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 3753 - 3769
  • [15] Overview of Multimodal Machine Learning
    Al-Zoghby, Aya M.
    Al-Awadly, Esraa Mohamed K.
    Ebada, Ahmed Ismail
    Awad, Wael A.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2025, 24 (01)
  • [16] Orchestrating the Development Lifecycle of Machine Learning-based IoT Applications: A Taxonomy and Survey
    Qian, Bin
    Su, Jie
    Wen, Zhenyu
    Jha, Devki Nandan
    Li, Yinhao
    Guan, Yu
    Puthal, Deepak
    James, Philip
    Yang, Renyu
    Zomaya, Albert Y.
    Rana, Omer
    Wang, Lizhe
    Koutny, Maciej
    Ranjan, Rajiv
    ACM COMPUTING SURVEYS, 2020, 53 (04)
  • [17] A Taxonomy of Machine-Learning-Based Intrusion Detection Systems for the Internet of Things: A Survey
    Jamalipour, Abbas
    Murali, Sarumathi
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (12) : 9444 - 9466
  • [18] Multimodal Federated Learning: A Survey
    Che, Liwei
    Wang, Jiaqi
    Zhou, Yao
    Ma, Fenglong
    SENSORS, 2023, 23 (15)
  • [19] Multimodal Learning With Transformers: A Survey
    Xu, Peng
    Zhu, Xiatian
    Clifton, David A.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 12113 - 12132
  • [20] Multimodal Machine Learning for Pedestrian Detection
    Aledhari, Mohammed
    Razzak, Rehma
    Parizi, Reza M.
    Srivastava, Gautam
    2021 IEEE 93RD VEHICULAR TECHNOLOGY CONFERENCE (VTC2021-SPRING), 2021,