Survey on Video Moment Retrieval

被引:0
|
作者
Wang Y. [1 ]
Zhan Y.-W. [1 ]
Luo X. [1 ]
Liu M. [2 ]
Xu X.-S. [1 ]
机构
[1] School of Software, Shandong University, Jinan
[2] School of Computer Science and Technology, Shandong Jianzhu University, Jinan
来源
Ruan Jian Xue Bao/Journal of Software | 2023年 / 34卷 / 02期
关键词
artificial intelligence; deep learning; temporal activity localization via language; video moment retrieval; video understanding;
D O I
10.13328/j.cnki.jos.006707
中图分类号
学科分类号
摘要
Given a natural language sentence as the query, the task of video moment retrieval aims to localize the most relevant video moment in a long untrimmed video. Based on the rich visual, text, and audio information contained in the video, how to fully understand the visual information provided in the video and utilize the text information provided by the query sentence to enhance the generalization and robustness of model, and how to align and interact cross-modal information are crucial challenges of the video moment retrieval. This study systematically sorts out the work in the field of video moment retrieval, and divides them into ranking-based methods and localization-based methods. Thereinto, the ranking-based methods can be further divided into the methods of presetting candidate clips, and the methods of generating candidate clips with guidance; the localization-based methods can be divided into one-time localization methods and iterative localization ones. The datasets and evaluation metrics of this fieldf are also summarized and the latest advances are reviewed. Finally, the related extension task is introduced, e.g., moment localization from video corpus, and the survey is concluded with a discussion on promising trends. © 2023 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:985 / 1006
页数:21
相关论文
共 92 条
  • [1] Wang S, Wang WY, Chen SZ, Jin Q., Video memorability prediction based on global and local information, Ruan Jian Xue Bao/ Journal of Software, 31, 7, pp. 1969-1979, (2020)
  • [2] Yu Q, Gao Y, Huo J, Zhuang YK., Discriminative joint multi-manifold analysis for video-based face recognition, Ruan Jian Xue Bao/Journal of Software, 26, 11, pp. 2897-2911, (2015)
  • [3] Liu T, Wang SL, Zhan NJ., Safety verification of trajectory planning for multiple robots, Ruan Jian Xue Bao/Journal of Software, 28, 5, pp. 1118-1127, (2017)
  • [4] Zhu XL, Wang HC, You HM, Zhang WH, Zhang YY, Liu S, Chen JJ, Wang Z, Li KQ., Survey on testing of intelligent systems in autonomous vehicles, Ruan Jian Xue Bao/Journal of Software, 32, 7, pp. 2056-2077, (2021)
  • [5] Zhang GM, Li QB, Zhang P, Cheng SJ., Defending code reuse attacks based on running characteristics monitoring, Ruan Jian Xue Bao/Journal of Software, 30, 11, pp. 3518-3534, (2019)
  • [6] Tellex S, Kollar T, Shaw G, Roy N, Roy D., Grounding spatial language for video search, Proc. of the 12th Int’l Conf. on Multimodal Interfaces
  • [7] the 7th Int’l Workshop on Machine Learning for Multimodal Interaction, 31, pp. 1-31:8, (2010)
  • [8] Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M., Grounding action descriptions in videos, Trans. of the Association for Computational Linguistics, 1, pp. 25-36, (2013)
  • [9] Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B., Script data for attribute-based recognition of composite activities, Proc. of the12th European Conf. on Computer Vision, pp. 144-157, (2012)
  • [10] Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B., Localizing moments in video with natural language, Proc. of the 2017 IEEE Int’l Conf. on Computer Vision, pp. 5804-5813, (2017)