With the development of computer technology, the Internet floods with abundant multimodal data. For better understanding users' feelings, multimodal sentiment analysis and sarcasm detection have become popular research topics. However, previous studies did not take noise into account when designing models. In this paper, based on designing a novel architecture, we also introduce a momentum distillation method to improve the model's performance from noisy data. Specifically, we propose the Transformer-Based Network with Momentum Distillation (TBNMD). For model architecture, we first encode different modalities to obtain hidden representations. Then we use a multimodal interaction module to obtain text-guided image features and image-guided text features. After that, we use a multimodal fusion module to obtain the fusion features. For momentum distillation, it is a self-distillation method. During the training process, the teacher model generates semantically similar samples as additional supervision of the student model. Experimental results on five publicly available datasets demonstrate the effectiveness of our method.