Multimodal Machine Learning: A Survey and Taxonomy

被引：2031

作者：

Baltrusaitis, Tadas ^{[1
]}

Ahuja, Chaitanya ^{[2
]}

Morency, Louis-Philippe ^{[2
]}

机构：

[1] Microsoft Corp, Cambridge CB1 2FB, England

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2019年 / 41卷 / 02期

基金：

美国国家科学基金会;

关键词：

Multimodal; machine learning; introductory; survey; EMOTION RECOGNITION; NEURAL-NETWORKS; SPEECH; TEXT; FUSION; VIDEO; LANGUAGE; MODELS; GENERATION; ALIGNMENT;

D O I：

10.1109/TPAMI.2018.2798607

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

引用

页码：423 / 443

页数：21

共 50 条

[41] Multimodal Machine Learning for Automated ICD Coding
Xu, Keyang
Lam, Mike
Pang, Jingzhi
Gao, Xin
Band, Charlotte
Mathur, Piyush
Papay, Frank
Khanna, Ashish K.
Cywinski, Jacek B.
Maheshwari, Kamal
Xie, Pengtao
Xing, Eric P.
Proceedings of Machine Learning Research, 2019, 106 : 197 - 215
[42] Interactive Machine Learning for Multimodal Affective Computing
Titung, Rajesh
2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2022,
[43] Review of multimodal machine learning approaches in healthcare
Krones, Felix
Marikkar, Umar
Parsons, Guy
Szmul, Adam
Mahdi, Adam
INFORMATION FUSION, 2025, 114
[44] A Machine Learning-Oriented Survey on Tiny Machine Learning
Capogrosso, Luigi
Cunico, Federico
Cheng, Dong Seon
Fummi, Franco
Cristani, Marco
IEEE ACCESS, 2024, 12 : 23406 - 23426
[45] A TAXONOMY OF QUALITY OF SERVICE AND QUALITY OF EXPERIENCE OF MULTIMODAL HUMAN-MACHINE INTERACTION
Moeller, Sebastian
Engelbrecht, Klaus-Peter
Kuehnel, Christine
Wechsung, Ina
Weiss, Benjamin
QOMEX: 2009 INTERNATIONAL WORKSHOP ON QUALITY OF MULTIMEDIA EXPERIENCE, 2009, : 7 - 12
[46] A Survey on Deep Learning for Multimodal Data Fusion
Gao, Jing
Li, Peng
Chen, Zhikui
Zhang, Jianing
NEURAL COMPUTATION, 2020, 32 (05) : 829 - 864
[47] Survey of Research on Deep Multimodal Representation Learning
Pan, Mengzhu
Li, Qianmu
Qiu, Tian
Computer Engineering and Applications, 2024, 59 (02) : 48 - 64
[48] Federated Learning on Multimodal Data: A Comprehensive Survey
Yi-Ming Lin
Yuan Gao
Mao-Guo Gong
Si-Jia Zhang
Yuan-Qiao Zhang
Zhi-Yuan Li
Machine Intelligence Research, 2023, 20 : 539 - 553
[49] A Survey of Multimodal Learning: Methods, Applications, and Future
Yuan, Yuan
Li, Zhaojian
Zhao, Bin
ACM COMPUTING SURVEYS, 2025, 57 (07)
[50] Machine Learning Techniques: A Survey
Kour, Herleen
Gondhi, Naveen
INNOVATIVE DATA COMMUNICATION TECHNOLOGIES AND APPLICATION, 2020, 46 : 266 - 275

← 1 2 3 4 5 →