Effective feature description for cross-modal remote sensing matching is challenging due to the complex geo-metric and radiometric differences between multimodal images. Currently, Siamese or pseudo-Siamese networks directly describe features from multimodal remote sensing images at the fully connected layer, however, the similarity of cross-modal features during feature extraction is barely considered. Therefore, we construct a cross -modal feature description matching network (CM-Net) for remote sensing image matching in this paper. First, a contextual self-attention module is proposed to add semantic global dependency information using candidate and non-candidate keypoint patches. Then, a cross-fusion module is designed to obtain cross-modal feature de-scriptions through information interaction. Finally, a similarity matching loss function is presented to optimize discriminative feature representations, converting a matching task into a classification task. The proposed CM -Net model is evaluated by qualitative and quantitative experiments on four multimodal image datasets, which achieves the average Matching score (M.S.), Mean Matching Accuracy (MMA), and average Root-mean-square error (aRMSE) of 0.781, 0.275, and 1.726, respectively. The comparative study demonstrates the superior per-formance of the proposed CM-Net for the remote sensing image matching.