With the rapid development of remote sensing technology, the continuous accumulation of remote sensing time series data provides an important data support for studying land cover classification. Extracting classificational discriminative features from remote sensing time series data by using deep learning methods has become a hot research topic. Deep learning methods require a large number of training data, but sample imbalance prevents the commonly used recurrent and convolutional networks from achieving high accuracies in categories that have a small number of samples. To address this problem, this paper introduces the self-attention mechanism originating from the field of natural language processing to the classification of multispectral remote sensing time series data with the aim of extracting deep temporal features at a global scale. This mechanism differs from recurrent networks, which extract temporal features by using the previous time information along the temporal dimension, and from convolutional networks, which extract temporal features at the local time neighborhood. We construct a new feature extraction network based on the transformer encoder, which initially employs the self-attention mechanism in natural language processing, and then compare this neatwork with the long- and short-term-memory-based feature extraction network and temporal-convolution-neural-network-based feature extraction network to evaluate the effectiveness of the self-attention-based method in improving the classification accuracy of small-sample categories. To achieve a fair comparison, we adopt a generic classification framework consisting of data input, feature extraction network, classifier, and classification output, and we use different models with various hyperparameters as feature extraction networks. We then evaluate the classification performance of different methods on the TiSeLaC public multispectral remote sensing time series dataset by using per-class accuracy, overall accuracy (OA), and mean intersection over union (mIoU) as metrics. To obtain a proper measure of different methods, we choose the top three mIoU hyperparameter settings for each model and then calculate the average metrics as the final result. Results show that the self-attention-based network outperforms both the recurrent and convolutional networks. This network achieves a 92.98% OA and 80.60% mIoU, which are 1.25% and 1.32% higher than those achieved by the recurrent and convolutional networks, respectively. In terms of per-class accuracy, while the self-attention-based network achieves equivalent accuracies with differences of less than 0.74% in the large-sample categories compared with the recurrent and convolutional networks, the proposed network can significantly improve classification accuracies in small-sample categories by large margins ranging from 2.47% to 5.41%. This paper introduces the self-attention mechanism to the classification of multispectral remote sensing time series data to address the problem of low classification accuracy in small-sample categories caused by sample imbalance. We construct a new temporal feature extraction network based on the self-attention mechanism to globally extract temporal features from time series and design a set of objective comparison experiments. Experiment results show that by globally extracting temporal features from time series, instead of using previous time information (as in the case of recurrent networks) and focusing on the local time neighborhood (as in the case of convolutional networks), the self-attention-based network achieves the same accuracy in majority-sample categories and effectively improves the accuracy in small-sample categories. Therefore, the self-attention-based network can play an important role in the future classification of remote sensing time series, and further research on this network is critical. © 2023 National Remote Sensing Bulletin