In this work, we use a hierarchical architecture based on detector-classifier for gesture recognition task. During the operation of the architecture, the detector,which is essentially the switch of the classifier,is always running. When the output of the detector is true, then the classifier is activated and returns a classification label for the input video stream. Our work focuses on the improvement of detectors and classifiers. In the detector, we introduce an attention mechanism to guide the network to focus on the space and channel where the gesture is located. For the classifier, based on the RGB information stream, we use an independent branch to extract the features of the depth stream, and finally merge the two branches. Because gestures move in a three-dimensional space, depth information can make up for the lack of RGB information. Experiments show that on the Egogesture test set, our detector achieves 98.86% accuracy on RGB input, while the classifier achieves 93.85% accuracy. At the same time, our gesture recognition architecture can fully meet the real-time requirements.