The rapid growth of Internet and multimedia information has shown a need in the development of multimedia information retrieval techniques, especially in image retrieval. We can distinguish two main trends. The first one, called "text-based image retrieval", consists in applying text-retrieval techniques from fully annotated images. The text describes high-level concepts but this technique presents some drawbacks: it requires a tedious work of annotation. Moreover, annotations could be ambiguous because two users can use different keywords to describe a same image. Consequently some approaches have proposed to useWordnet in order to reduce these potential ambiguities. The second approach, called "content-based image retrieval" is a younger field. These methods rely on visual features (color, texture or shape) computed automatically, and retrieve images using a similarity measure. However, the obtained performances are not really acceptable, except in the case of well-focused corpus. In order to improve the recognition, a solution consists in combining visual and semantic information. In many vision problems, instead of having fully annotated training data, it is easier to obtain just a subset of data with annotations, because it is less restrictive for the user. This paper deals with modeling, classifying, and annotating weakly annotated images. More precisely, we propose a scheme for image classification optimization, using a joint visual-text clustering approach and automatically extending image annotations. The proposed approach is derived from the probabilistic graphical model theory and dedicated for both tasks of weakly-annotated image classification and annotation. We consider an image as weakly annotated if the number of keywords defined for it is less than the maximum defined in the ground truth. Thanks to their ability to manage missing values, a probabilistic graphical model has been proposed to represent weakly annotated images. We propose a probabilistic graphical model based on a Gaussian-Mixtures and Multinomial mixture. The visual features are estimated by the Gaussian mixtures and the keywords by a Multinomial distribution. Therefore, the proposed model does not require that all images be annotated: when an image is weakly annotated, the missing keywords are considered as missing values. Besides, our model can automatically extend existing annotations to weakly-annotated images, without user intervention. The uncertainty around the association between a set of keywords and an image is tackled by a joint probability distribution (defined from Gaussian-Mixtures and Multinomial mixture) over the dictionary of keywords and the visual features extracted from our collection of images. Moreover, in order to solve the dimensionality problem due to the large dimensions of visual features, we have adapted a variable selection method. Results of visual-textual classification, reported on a database of images collected from the Web, partially and manually annotated, show an improvement of about 32.3% in terms of recognition rate against only visual information classification. Besides the automatic annotation extension with our model for images with missing keywords outperforms the visual-textual classification of about 6.8%. Finally the proposed method is experimentally competitive with the state-of-art classifiers.