Scene recognition is one of the most important tasks in computer vision. Apart from appearance, spatial layout carries the crucial cue for discriminative representation. In this paper, we propose spatial ensemble kernel (SEK) learning, which enables fusion of multi-scale spatial information to achieve compact while discriminative representation of scenes. Based on the spatial pyramid, SEK combines the CNN features in each level of the pyramid in an ensemble and fuse them by kernels. By kernel approximation, we achieve Fourier feature embedding of CNN features in each scale, which establishes a nonlinear layer of the neural network with a cosine activation function. The parameters of the nonlinear layer can be learned jointly in one single optimization framework by supervised learning, which enables compact and discriminative feature representations. We show the effectiveness of the proposed SEK on two recent scene benchmark datasets, i.e., MIT indoor and SUN 397. The propose SEK produces high performance on two datasets which are competitive to state-of-the-art algorithms.