This work intends to help students perceive music, study music, create music, and realize the "human -computer interaction" music teaching mode. A distributed design pat-tern is adopted to design a gesture interactive robot suitable for music education. First, the client is designed. The client gesture acquisition module employs a dual-channel convolu-tional neural network (DCCNN) for gesture recognition. The convolutional layer of the constructed DCCNN contains convolution kernels with two sizes, which operate on the image. Second, the server is designed, which recognizes the collected gesture instruction data through two-stream convolutional neural network (CNN). This network cuts the ges-ture instruction data into K segments, and sparsely samples each segment into a short se-quence. The optical flow algorithm is employed to extract the optical flow features of each short sequence. Finally, the performance of the robot is tested. The results show that the combination of convolution kernels with sizes of 5x5 and 7x7 has a recognition accuracy of 98%, suggesting that DCCNN can effectively collect gesture command data. After train-ing, DCCNN's gesture recognition accuracy rate reaches 90%, which is higher than main-stream dynamic gesture recognition algorithms under the same conditions. In addition, the recognition accuracy of the gesture interactive robot is above 90%, suggesting that this robot can meet normal requirements and has good reliability and stability. It is also rec-ommended to be utilized in music perception teaching to provide a basis for establishing a multi-sensory music teaching model.