Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations

被引:4
|
作者
Liu, Weijia [1 ]
Cao, Jiuxin [1 ,2 ]
Wei, Ran [3 ]
Zhu, Xuelin [1 ]
Liu, Bo [3 ]
机构
[1] Southeast Univ, Sch Cyber Sci Engn, Nanjing 211189, Peoples R China
[2] Purple Mt Labs, Nanjing 211111, Peoples R China
[3] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
基金
中国国家自然科学基金;
关键词
Micro-video venue recognition; graph neural network; attention mechanism; multi-modal fusion; online social network; SCENE; FEATURES; CNN;
D O I
10.1109/TCSVT.2023.3349202
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Micro-video venue recognition aims to predict the venue category where a micro-video was filmed. Different from traditional long videos which contain rich temporal context, venue prediction for micro-videos is difficult due to its limited duration (generally within 6s). The existing works usually extract features of each modality from a global perspective for prediction, neglecting the semantics carried by local objects. To this end, we propose Multi-Modal and Multi-Granularity Object Relations ((MORE)-O-2) to address the above issues, which learns multi-granularity interactive semantics between venues and multimodal semantic objects to help understand venues. Specifically, (MORE)-O-2 comprises of two modules: it first extract semantic objects of different modalities, i.e. visual objects in keyframes and keywords in texts, and models the affiliation relationship between semantic objects and venues and the co-occurrence relationship among semantic objects, forming a heterogeneous venue-object relation graph. Then, to achieve the interactive semantics between venues and objects from the relation graph, a novel Parallel-Graph Inference Model (Parallel-GIM) is proposed, which updates the representation of nodes through graph propagation and fuse multi-level features (local-global-multimodal) through the devised hierarchical attention mechanism. Finally, the probability distribution of venues can be obtained through a multi-layer perceptron with the comprehensive features of the venue nodes. Extensive experiments on real-world micro-video dataset demonstrate the superiority of the proposed (MORE)-O-2.
引用
收藏
页码:5440 / 5451
页数:12
相关论文
共 50 条
  • [1] Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation
    Zhong, Ting
    Lang, Jian
    Zhang, Yifan
    Cheng, Zhangtao
    Zhang, Kunpeng
    Zhou, Fan
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2579 - 2583
  • [2] Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification
    Wei Liu
    Xianglin Huang
    Gang Cao
    Jianglong Zhang
    Gege Song
    Lifang Yang
    Multimedia Tools and Applications, 2020, 79 : 6709 - 6726
  • [3] Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification
    Liu, Wei
    Huang, Xianglin
    Cao, Gang
    Zhang, Jianglong
    Song, Gege
    Yang, Lifang
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (9-10) : 6709 - 6726
  • [4] Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
    Guo, Jie
    Nie, Xiushan
    Yin, Yilong
    IEEE ACCESS, 2020, 8 : 29518 - 29524
  • [5] Multi-modal Graph Contrastive Learning for Micro-video Recommendation
    Yi, Zixuan
    Wang, Xi
    Ounis, Iadh
    Macdonald, Craig
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1807 - 1811
  • [6] Multi-modal information augmented model for micro-video recommendation
    Huo Y.
    Jin B.
    Liao Z.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2024, 58 (06): : 1142 - 1152
  • [7] Multi-modal Sarcasm Detection on Social Media via Multi-Granularity Information Fusion
    Ou, Lisong
    Li, Zhixin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)
  • [8] Deep Multi-Modal Hashing With Semantic Enhancement for Multi-Label Micro-Video Retrieval
    Jing, Peiguang
    Sun, Haoyi
    Nie, Liqiang
    Li, Yun
    Su, Yuting
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (10) : 5080 - 5091
  • [9] Micro-video multi-label classification method based on multi-modal feature encoding
    Jing P.
    Li Y.
    Su Y.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2022, 49 (04): : 109 - 117
  • [10] MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video
    Wei, Yinwei
    Wang, Xiang
    Nie, Liqiang
    He, Xiangnan
    Hong, Richang
    Chua, Tat-Seng
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1437 - 1445