Deep Multimodal Data Fusion

被引:22
|
作者
Zhao, Fei [1 ]
Zhang, Chengcui [2 ]
Geng, Baocheng [3 ]
机构
[1] Univ Alabama Birmingham, Univ Hall 4105,1402 10th Ave S, Birmingham, AL 35294 USA
[2] Univ Alabama Birmingham, Univ Hall 4143,1402 10th Ave S, Birmingham, AL 35294 USA
[3] Univ Alabama Birmingham, Univ Hall 4147,1402 10th Ave S, Birmingham, AL 35294 USA
关键词
Data fusion; neural networks; multimodal deep learning; PERSON REIDENTIFICATION; ATTENTION NETWORK; NEURAL-NETWORKS; URBAN DATASET; IMAGE;
D O I
10.1145/3649447
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.
引用
收藏
页数:36
相关论文
共 50 条
  • [21] TRAINING OF DEEP BIDIRECTIONAL RNNS FOR HAND MOTION FILTERING VIA MULTIMODAL DATA FUSION
    Shahtalebi, Soroosh
    Atashzar, S. Farokh
    Patel, Rajni, V
    Mohammadi, Arash
    2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [22] Multimodal Data Fusion for Precise Lettuce Phenotype Estimation Using Deep Learning Algorithms
    Hou, Lixin
    Zhu, Yuxia
    Wang, Mengke
    Wei, Ning
    Dong, Jiachi
    Tao, Yaodong
    Zhou, Jing
    Zhang, Jian
    PLANTS-BASEL, 2024, 13 (22):
  • [23] Research progress on electronic health records multimodal data fusion based on deep learning
    Fan, Yong
    Zhang, Zhengbo
    Wang, Jing
    Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (05): : 1062 - 1071
  • [24] Multimodal deep fusion for image question answering
    Zhang, Weifeng
    Yu, Jing
    Wang, Yuxia
    Wang, Wei
    KNOWLEDGE-BASED SYSTEMS, 2021, 212
  • [25] Deep Multimodal Fusion for Surgical Feedback Classification
    Kocielnik, Rafal
    Wong, Elyssa Y.
    Chu, Timothy N.
    Lin, Lydia
    Huang, De-An
    Wang, Jiayun
    Anandkumar, Anima
    Hung, Andrew J.
    MACHINE LEARNING FOR HEALTH, ML4H, VOL 225, 2023, 225 : 256 - 267
  • [26] Advances in deep learning for multimodal fusion and alignment
    Multimedia Tools and Applications, 2022, 81 : 11931 - 11931
  • [27] Advances in deep learning for multimodal fusion and alignment
    Huang, Feiran
    Mumtaz, Shahid
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (09) : 11931 - 11931
  • [28] Interpretation on Deep Multimodal Fusion for Diagnostic Classification
    Xin, Bowen
    Huang, Jing
    Zhou, Yun
    Lu, Jie
    Wang, Xiuying
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [29] Deep Learning for HABs Prediction with Multimodal Fusion
    Zhao, Fei
    Zhang, Chengcui
    31ST ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS, ACM SIGSPATIAL GIS 2023, 2023, : 17 - 18
  • [30] User Profiling through Deep Multimodal Fusion
    Farnadi, Golnoosh
    Tang, Jie
    De Cock, Martine
    Moens, Marie-Francine
    WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2018, : 171 - 179