In recent years, research on multimedia event extraction has emerged. However, due to the lack of support from large-scale annotated datasets, most of the existing studies rely on weakly supervised methods from different datasets in the training and testing phases, which inevitably leads to event extraction being affected by dataset distribution differences and noise. Meanwhile, although modal fusion can effectively model the correlation and complementarity between different modalities, this process may also introduce more noise, which may affect the extraction results. To address the above problems, we propose a multimedia event extraction method based on multimodal low-dimensional feature representation space (MLDFR), which pays more attention to the handling of noise interference during the multimodal fusion process. On the one hand, MLDFR combines contrast learning and momentum distillation techniques to construct a low-dimensional feature representation space, which enhances the model's ability to match text and images in the representation space, and effectively mitigates the interference of dataset noise on multimodal information fusion. On the other hand, in the visual event extraction process, MLDFR not only fuses the corresponding textual events as additional features, but also generates the corresponding image descriptions through the generative model and integrates them into the extraction process as further complementary features to better model the inter-modal correlations. Several experimental results based on the benchmark dataset show that the proposed MLDFR method can significantly improve the performance of multimedia event extraction.