InteractNet: Social Interaction Recognition for Semantic-rich Videos

被引:0
|
作者
Lyu, Yuanjie [1 ]
Qin, Penggang [1 ]
Xu, Tong [1 ]
Zhu, Chen [1 ,2 ]
Chen, Enhong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] BOSS Zhipin, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal analysis; video-and-language understanding; graph convo- lutional network;
D O I
10.1145/3663668
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics such as character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes, and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this article, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi- modal cues and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Semantic visual recognition in a cognitive architecture for social robots
    Martin-Rico, Francisco
    Gomez-Donoso, Francisco
    Escalona, Felix
    Garcia-Rodriguez, Jose
    Cazorla, Miguel
    INTEGRATED COMPUTER-AIDED ENGINEERING, 2020, 27 (03) : 301 - 316
  • [42] STIT: Spatio-Temporal Interaction Transformers for Human-Object Interaction Recognition in Videos
    Almushyti, Muna
    Li, Frederick W. B.
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 3287 - 3294
  • [43] Online learnable keyframe extraction in videos and its application with semantic word vector in action recognition
    Elahi, G. M. Mashrur E.
    Yang, Yee-Hong
    PATTERN RECOGNITION, 2022, 122
  • [44] Hierarchical visual-semantic interaction for scene text recognition
    Diao, Liang
    Tang, Xin
    Wang, Jun
    Xie, Guotong
    Hu, Junlin
    INFORMATION FUSION, 2024, 102
  • [45] Semantic interaction learning for fine-grained vehicle recognition
    Zhang, Jingjing
    Lei, Jingsheng
    Yang, Shengying
    Yang, Xinqi
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (01)
  • [46] Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition
    Benavent-Lledo, Manuel
    Oprea, Sergiu
    Alejandro Castro-Vargas, John
    Martinez-Gonzalez, Pablo
    Garcia-Rodriguez, Jose
    16TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS (SOCO 2021), 2022, 1401 : 439 - 448
  • [47] Semantic cross-correlation as a measure of social interaction
    Samsonovich, Alexei V.
    BIOLOGICALLY INSPIRED COGNITIVE ARCHITECTURES, 2014, 7 : 1 - 8
  • [48] Pose-Invariant Face Recognition in Videos for Human-Machine Interaction
    Raducanu, Bogdan
    Dornaika, Fadi
    COMPUTER VISION - ECCV 2012, PT II, 2012, 7584 : 566 - 575
  • [49] Long-term Residual Recurrent Network for Human Interaction Recognition in Videos
    Zhao, Yang
    Sun, Tanfeng
    Jiang, Xinghao
    Wang, Shilin
    2016 9TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2016), 2016, : 78 - 83
  • [50] Recognizing Social Relationships in Long Videos via Multimodal Character Interaction
    Teng, Yiyang
    Song, Chenguang
    Wu, Bin
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 573 - 577