Human action recognition (HAR) technology is currently of significant interest. The traditional HAR methods depend on the time and space of the video stream generally. It requires a mass of training datasets and produces a long response time, failing to simultaneously meet the real-time interaction technical requirements-high accuracy, low delay, and low computational cost. For instance, the duration of a gymnastic action is as short as 0.2 s, from action capture to recognition, and then to the visualization of a three-dimensional character model. Only when the response time of the application system is short enough can it guide synchronous training and accurate evaluation. To reduce the dependence on the amount of video data and meet the HAR technical requirements, this paper proposes a three-stream long-short term memory (TS-CNN-LSTM) framework combining the CNN and LSTM networks. Firstly, human data of color, depth, and skeleton collected by Microsoft Kinect are used as input to reduce the sample sizes. Secondly, heterogeneous convolutional networks are established to reduce computing costs and elevate response time. The experiment results demonstrate the effectiveness of the proposed model on the NTU-RGB + D, reaching the best accuracy of 87.28% in the Cross-subject mode. Compared with the state-of-the-art methods, our method uses 75% of the training sample size, while the complexity of time and space only occupies 67.5% and 73.98% respectively. The response time of one set action recognition is improved by 0.90–1.61 s, which is especially valuable for timely action feedback. The proposed method provides an effective solution for real-time interactive applications which require timely human action recognition results and responses.