Object detection based on visible images is difficult to adapt to complex lighting conditions such as low light, no light, strong light, etc., while object detection based on infrared images is greatly affected by background noise. Infrared objects lack color information and have weak texture features, which pose a greater challenge. To address these problems, a dual-modal object detection approach that can effectively fuse the features of visible and infrared dual-modal images is proposed. A multiscale feature attention module is proposed, which can extract the multiscale features of the input IR and RGB images separately. Meanwhile, channel attention and spatial pixel attention is introduced to focus the multiscale feature information of dual-modal images from both channel and pixel dimensions. Finally, a dual-modal feature fusion module is proposed to adaptively fuse the feature information of dual-modal images. On the large-scale dual-modal image dataset DroneVehicle, compared with the benchmark algorithm YOLOv5s using visible or infrared single-modal image detection, the proposed algorithm improves the detection accuracy by 13.42 and 2.27 percentage points, and the detection speed reaches 164 frame/s, with ultra-real-time end-to-end detection capability. The proposed algorithm effectively improves the robustness and accuracy of object detection in complex scenes, which has good application prospects. © 2024 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.