共 31 条
- [1] Linchao Zhu, Yi Yang, ActBERT: Learning global-local video-text representations[C], Proc of the IEEE Conf on Computer Vision and Pattern Recognition, pp. 8746-8755, (2020)
- [2] Huaishao Luo, Lei Ji, Botian Shi, Et al., Univl: A unified video and language pre-training model for multimodal understanding and generation [J], (2020)
- [3] Linjie Li, Yen-Chun Chen, Cheng Yu, Et al., HERO: Hierarchical encoder for video+ language omni-representation pre-training [C], Proc of the Conf on Empirical Methods in Natural Language Processing, pp. 2046-2065, (2020)
- [4] Gabeur V, Sun Chen, Alahari K, Et al., Multi-modal transformer for video retrieval[C], Proc of European Conf on Computer Vision, pp. 214-229, (2020)
- [5] Patrick M, Huang Poyao, Asano Y, Et al., Support-set bottlenecks for video-text representation learning
- [6] Rouditchenko A, Boggust A, Harwath D, Et al., Avlnet: Learning audiovisual language representations from instructional videos [J], (2006)
- [7] Famin Wu, Guangyi Lu, Qi Liu, Et al., Deep semantic representation of time-sync comments for videos[J], Journal of Computer Research and Development, 56, 2, (2019)
- [8] Fan Yang, Bin Xiao, Zhiwen Yu, Anomaly detection and modeling of surveillance video[J], Journal of Computer Research and Development, 58, 12, pp. 2708-2723, (2021)
- [9] Haitao Yu, Xiaoshan Yang, Changsheng Xu, Antagonistic video generation method based on multimodal input[J], Journal of Computer Research and Development, 57, 7, (2020)
- [10] Yujie Dian, Qin Jin, Audio-visual correlated multimodal concept detection[J], Journal of Computer Research and Development, 56, 5, (2019)