Object detection for aerial remote sensing images is a foundation task in earth observation community. However, various challenges still exist in this field, including the varied appearances of targets to be detected, the complexity of image background and the expensive manual annotation. To tackle these problems, we proposed a Faster R-CNN based framework with several elaborate designs. Our detector employs a bidirectional enhancement feature pyramid network into the framework, which can improve multi-scale feature extraction so as to effectively handle objects with different sizes. In the meantime, an attention module is present to further suppress noisy background. Moreover, we augment training sets by using a count-guided deep descriptor transforming (CG-DDT) algorithm, which can automatically generate coarse object bounding boxes for images with only class label and per-class object count. We have evaluated the proposed method on popular aerial remote sensing benchmarks, i.e., NWPU VHR-10 and DOTA, and the experimental results show that it can accurately detect targets while reducing the cost of manual annotations during training. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023.