It is difficult and costly to obtain large-scale, labeled crop disease data in the field of agriculture. How to use small samples of unlabeled data for feature learning has become an urgent problem that needs to be solved. The emergence of self-supervised contrastive learning methods and self-supervised mask learning methods can solve the problem of missing labels on the training data. However, each of these paradigms comes with its own advantages and drawbacks. At the same time, the features learned by dataset in a single modality are limited, ignoring the correlation with other modal information. Hence, this paper introduced an effective framework for multimodal self-supervised learning, denoted as MMSSL, to address the task of identifying cucumber diseases with small sample sizes. Integrating image self-supervised mask learning, image self-supervised contrastive learning, and multimodal image-text contrastive learning, the model can not only learn disease feature information from different modalities, but also capture global and local disease feature information. Simultaneously, the mask learning branch was enhanced by introducing a prompt learning module based on a cross-attention network. This module aided in approximately locating the masked regions in the image data in advance, facilitating the decoder in making accurate decoding predictions. Experimental results demonstrate that the proposed method achieves a 95% accuracy in cucumber disease identification in the absence of labels. The approach effectively uncovers high-level semantic features within multimodal small-sample cucumber disease data. GradCAM is also employed for visual analysis to further understand the decision-making process of the model in disease identification. In conclusion, the proposed method in this paper is advantageous for enhancing the classification accuracy of small-sample cucumber data in a multimodal, unlabeled context, demonstrating good generalization performance.