Action Recognition Based on Feature Interaction and Clustering

被引:0
|
作者
Li K. [1 ]
Cai P. [1 ]
Zhou Z. [1 ]
机构
[1] State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing
关键词
action recognition; feature clustering; feature interaction; spatiotemporal feature relationship;
D O I
10.3724/SP.J.1089.2023.19493
中图分类号
学科分类号
摘要
To mitigate the problem that the action recognition methods lack the modeling of spatiotemporal feature relationship, an action recognition method based on feature interaction and clustering is proposed. Firstly, a mixed multi-scale feature extraction network is designed to extract spatial and temporal features of continuous frames. Secondly, a feature interaction module is designed based on non-local operation to realize spatiotemporal feature interaction. Finally, based on the triplet loss function, a hard sample selection strategy is designed to train the recognition network, thus realizing spatiotemporal feature clustering and improving the robustness and discrimination of the features. Experimental results show that compared with TSN, the accuracy of on the UCF101 dataset is increased by 23.25 percentage points to 94.82%. On the HMDB51 dataset, the accuracy is increased by 20.27 percentage points to 44.03%. © 2023 Institute of Computing Technology. All rights reserved.
引用
收藏
页码:903 / 914
页数:11
相关论文
共 31 条
  • [21] Chang X B, Hospedales T M, Xiang T., Multi-level factorisation net for person re-identification, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2109-2118, (2018)
  • [22] Howard A G, Zhu M, Chen B, Et al., Mobilenets: efficient convolutional neural networks for mobile vision applications
  • [23] Buades A, Coll B, Morel J M., A non-local algorithm for image denoising, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 60-65, (2005)
  • [24] Vaswani A, Shazeer N, Parmar N, Et al., Attention is all you need, Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000-6010, (2017)
  • [25] Soomro K, Zamir A R, Shah M., UCF101: a dataset of 101 human actions classes from videos in the wild
  • [26] Kuehne H, Jhuang H, Garrote E, Et al., HMDB: a large video database for human motion recognition, Proceedings of the IEEE International Conference on Computer Vision, pp. 2556-2563, (2011)
  • [27] Sandler M, Howard A, Zhu M, Et al., Mobilenetv2: inverted residuals and linear bottlenecks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520, (2018)
  • [28] Varol G, Laptev I, Schmid C., Long-term temporal convolutions for action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 6, pp. 1510-1517, (2018)
  • [29] Duta I C, Ionescu B, Aizawa K, Et al., Spatio-temporal vector of locally max pooled features for action recognition in videos, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3205-3214, (2017)
  • [30] Butt A M, Yousaf M H, Murtaza F, Et al., Agglomerative clustering and residual-VLAD encoding for human action recognition, Applied Sciences, 10, 12, (2020)