Remote sensing images present formidable classification challenges due to their complex spatial organization, high inter-class similarity, and significant intra-class variability. To address the balance between computational efficiency and feature extraction capability in existing methods, this paper innovatively proposes a lightweight convolutional network, STConvNeXt. In its architectural design, the model incorporates a split-based mobile convolution module with a hierarchical tree structure. It employs parameterized depthwise separable convolutions to reduce computational complexity and constructs a multi-level feature tree to facilitate cross-scale feature fusion. For feature enhancement, a fast pyramid pooling module replaces the traditional spatial pyramid structure, effectively reducing the number of parameters while preserving large-scale contextual awareness. In terms of training strategy, a dynamic threshold loss function is introduced, utilizing a learnable inter-class margin to improve the model's ability to distinguish difficult-to-classify samples. Systematic experiments on the UCMerced, AID, and NWPU-RESISC45 benchmark datasets validate the effectiveness of the proposed approach: compared with the ConvNeXt baseline, STConvNeXt reduces both parameter count (by 56.49%) and FLOPs (by 49.89%), while improving classification accuracy by 1.2-2.7%. Furthermore, compared with the current state-of-the-art remote sensing scene classification models, our method still exhibits significant advantages. Ablation studies further confirm the effectiveness of each module design, particularly demonstrating that the model maintains excellent classification accuracy despite a substantial reduction in parameters.