Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

被引:0
|
作者
Rong, Huan [1 ]
Chen, Zhongfeng [1 ]
Lu, Zhenyu [1 ]
Xu, Fan [2 ]
Sheng, Victor S. [3 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Artificial Intelligence, 219 Ningliu Rd, Nanjing 210044, Jiangsu, Peoples R China
[2] Jiangxi Normal Univ, Sch Comp Informat Engn, Nanchang, Jiangxi, Peoples R China
[3] Texas Tech Univ, Dept Comp Sci, Rono Hills, Lubbock, TX 79430 USA
基金
中国国家自然科学基金;
关键词
Business intelligence; multi-modal summarization; semantic enhancement; and attention; multi-modal cross learning;
D O I
10.1145/3651983
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. However, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposedMultizationmethod effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates themulti-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.
引用
收藏
页数:29
相关论文
共 50 条
  • [31] Extractive summarization of documents with images based on multi-modal RNN
    Chen, Jingqiang
    Hai Zhuge
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 99 : 186 - 196
  • [32] MAFE: Multi-modal Alignment via Mutual Information Maximum Perspective in Multi-modal Fake News Detection
    Qin, Haimei
    Jing, Yaqi
    Duan, Yunqiang
    Jiang, Lei
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1515 - 1521
  • [33] Multi-Modal Supplementary-Complementary Summarization using Multi-Objective Optimization
    Jangra, Anubhav
    Saha, Sriparna
    Jatowt, Adam
    Hasanuzzaman, Mohammed
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 818 - 828
  • [34] MMEA: Entity Alignment for Multi-modal Knowledge Graph
    Chen, Liyi
    Li, Zhi
    Wang, Yijun
    Xu, Tong
    Wang, Zhefeng
    Chen, Enhong
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2020), PT I, 2020, 12274 : 134 - 147
  • [35] Gromov-Wasserstein Multi-modal Alignment and Clustering
    Gong, Fengjiao
    Nie, Yuzhou
    Xu, Hongteng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 603 - 613
  • [36] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [37] Adaptive Feature Fusion for Multi-modal Entity Alignment
    Guo H.
    Li X.-Y.
    Tang J.-Y.
    Guo Y.-M.
    Zhao X.
    Zidonghua Xuebao/Acta Automatica Sinica, 2024, 50 (04): : 758 - 770
  • [38] Semantic Alignment Network for Multi-Modal Emotion Recognition
    Hou, Mixiao
    Zhang, Zheng
    Liu, Chang
    Lu, Guangming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5318 - 5329
  • [39] Progressively Modality Freezing for Multi-Modal Entity Alignment
    Huang, Yani
    Zhang, Xuefeng
    Zhang, Richong
    Chen, Junfan
    Kim, Jaein
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3477 - 3489
  • [40] Focusing on Relevant Responses for Multi-Modal Rumor Detection
    Li, Jun
    Bin, Yi
    Peng, Liang
    Yang, Yang
    Li, Yangyang
    Jin, Hao
    Huang, Zi
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 6225 - 6236