A survey of multimodal machine learning

被引:0
|
作者
Chen P. [1 ,2 ]
Li Q. [1 ,2 ]
Zhang D.-Z. [3 ,4 ]
Yang Y.-H. [1 ]
Cai Z. [1 ]
Lu Z.-Y. [1 ]
机构
[1] School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing
[2] Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing
[3] School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing
[4] Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing
关键词
Adversarial learning; Deep learning; Feature representation; Multi-modal learning; Statistical learning;
D O I
10.13374/j.issn2095-9389.2019.03.21.003
中图分类号
学科分类号
摘要
"Big data" is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand "big data". The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person's understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of "media", a "mode" is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field. © All right reserved.
引用
收藏
页码:557 / 569
页数:12
相关论文
共 111 条
  • [1] Rhianna K., Pedwell J A., Hardy S L, Et al., Effective visual design and communication practices for research posters: Exemplars based on the theory and practice of multimedia learning and rhetoric, Biochem Mol Biol Educ, 45, 3, (2017)
  • [2] Welch K E., Electric Rhetoric: Classical Rhetoric, Oralism, and A New Literacy, (1999)
  • [3] Berlin James A, Contemporary composition: the major pedagogical theories, College English, 44, 8, (1982)
  • [4] O'Halloran K L, Interdependence, interaction and metaphor in multi-semiotic texts, Social Semiotics, 9, 3, (1999)
  • [5] O'Halloran K L, Classroom discourse in mathematics: a multi-semiotic analysis, Linguistics Educ, 10, 3, (1998)
  • [6] Morency L P, Baltrusaitis T., Tutorial on multimodal machine learning [R/OL]
  • [7] Plummer B A, Wang L W, Cervantes C M, Et al., Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models, Proceedings of IEEE International Conference on Computer Vision (ICCV 2015), (2015)
  • [8] von Glasersfeld E, Pisani P P, The multistore parser for hierarchical syntactic structures, Commun ACM, 13, 2, (1970)
  • [9] Jackson P., Introduction to Expert Systems, (1998)
  • [10] Cortes C, Vapnik V, Support-vector networks, Machine Learning, 20, 3, (1995)