With the ability to extract spatiotemporal features, 3D convolutional networks have become the mainstream method for Video Saliency Prediction (VSP). However, these methods cannot make full use of hierarchical spatio-temporal features and also lack focus on past salient features, which hinders further improvements in accuracy. To address these issues, we propose a 3D convolutional Network based on Hierarchical Fusion and Temporal Re-currence (HFTR-Net) for VSP. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), which adds the flow of shallow location information based on the previous flow of deep semantic infor-mation. Then, different from simple addition and concatenation, we design a Hierarchical Adaptive Fusion (HAF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, to utilize previous salient information, a Recall 3D convGRU (R3D GRU) module is integrated into the 3D convolution-based method for the first time. It subtly combines the local feature extraction of the 3D back-bone with the long-term relationship modeling of the temporal recurrence mechanism. Experimental results on the three common datasets demonstrate that the HFTR-Net outperforms existing state-of-the-art methods in accuracy.& COPY; 2023 Elsevier B.V. All rights reserved.