BackgroundWith the increasing integration of Artificial Intelligence (AI) and Internet of Things (IoT), the dissemination of multimodal data is undergoing revolutionary changes. To mitigate the societal risks posed by the rapid spread of malicious multimodal data, such as hateful memes, it is crucial to develop effective detection methods for such data. Existing detection models often struggle with data quality issues and lack interpretability, limiting their effectiveness in content moderation tasks.AimsThis paper aims to propose an explainable hateful meme detection model by uncertainty-aware dynamic fusion. The goal is to enhance both generalization performance and interpretability, addressing the limitations of conventional static fusion methods and existing algorithms for hateful meme detection.Materials & MethodsTo mitigate the societal risks posed by the rapid spread of malicious multimodal data, such as hateful memes, it is crucial to develop effective detection methods for such data. However, existing algorithms for hateful meme detection frequently overlook the data quality and the interpretability of model. To adress these challenges, this paper proposes Hate-UDF, an explainable hateful meme detection model with uncertainty-aware dynamic fusion, providing both high generalization ability and interpretability. This method dynamically evaluates the uncertainty of different modalities, obtains dynamic weights, and utilizes them to weight the feature values for fusion, thereby obtaining a uncertainty-aware dynamic fusion method with provable upper bounds on generalization error. Furthermore, an analysis of the dynamic weights can explain the modality on which the model primarily relies for detection, thereby providing a method that is both explainable and reliable.ResultsWe compare the performance of Hate-UDF with three general models and three State of the Art (SOTA) models in the field of hateful meme detection on the Facebook Hateful Memes (FHM) and the Multimedia Automatic Misogyny Identification (MAMI) datasets. Hate-UDF achieved state-of-the-art performance, surpassing existing models on both datasets. Specifically, it improved accuracy and AUC by 7.56% and 2.8% on FHM and by 3.34% and 0.17% on MAMI compared with the current SOTA model, respectively. Additionally, we demonstrate that the visual modality is more important than the textual modality in the hateful meme detection model, and we explain the primary reason behind this by visualization.DiscussionThe model dynamically adapts to modality quality, enhancing reliability and reducing the risk of misclassification. Its interpretability, achieved through visualizations of modality and feature attributions, provides valuable insights for content moderation systems and highlights the importance of image modality in detecting hateful meme. While Hate-UDF provides an explainable and reliable method for detecting hateful memes, it may still learn biases from the training data, potentially leading to the over-detection of content from certain groups or communities. Future research must focus on improving the fairness and ethical responsibilities of the model's decisions.ConclusionThis paper introduces the model of Hate-UDF, a dynamic fusion method based on uncertainty, designed to improve multimodal fusion issues in existing hateful meme detection models. The model determines the reliability of different modal information by assessing their uncertainty and generates dynamic weights accordingly. By comparing these weights, the model can identify which modality is most influential in detecting malicious content. Therefore, the Hate-UDF model not only has interpretability but also its generalization performance has been validated.