This paper presents an end-to-end, scalable, and flexible framework for multimodal multimedia information retrieval (MMIR). This framework is designed to handle multiple data modalities, such as visual, audio, and text, frequently encountered in real-world applications. By integrating these different data types, this framework facilitates a more holistic understanding of information, thus improving the accuracy and reliability of retrieval tasks. One of the strengths of this framework is its ability to learn semantic relationships within and between modalities through advanced deep neural networks. These networks are trained on query-hit pairs generated from query logs. A major innovation of this approach lies in the efficient handling of multimodal data uncertainty through an improved fuzzy clustering technique. Additionally, the search process is refined through the use of triplet-loss Siamese networks for sophisticated reranking, as well as a novel fusion approach using the ordered weighted average (OWA) operator to combine the ranks of different retrieval systems. This framework leverages parallel processing and transfer learning for efficient feature extraction across different modalities, thus significantly improving scalability and adaptability. Performance has been rigorously evaluated through comprehensive testing on six widely recognized multimodal datasets. The results indicate that this integrated approach, which combines clustering ranking, triplet loss Siamese network for reranking, OWA-based fusion, and the alternative adaptive fuzzy means method (AAFCM) for soft clustering, consistently outperforms all previous configurations reported in the literature. Our experimental results, supported by extensive statistical analysis, confirm the effectiveness and robustness of this approach in MMIR.