Multimodal Entity and Relation Extraction (MERE) encompasses tasks, including Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE), aiming to extract valuable information from environments rich in multimodal data. Currently, many research endeavors face various challenges, including the insufficient utilization of emotional information in multimodal data, mismatches between textual and visual content, ambiguous meanings, and difficulties achieving precise alignment across different semantic levels. To address these issues, we propose the Hierarchical Generation of Multi Evidence Alignment Fusion Model for Multimodal Entity and Relation Extraction (HGMAF). This model comprises a hierarchical diffusion semantic generation stage and a multi-evidence alignment fusion module. Initially, we designed different prompt templates for the original text, using the Large Language Model (LLM) to generate corresponding hierarchical textual content. Subsequently, the generated hierarchical content is diffused to obtain images with rich hierarchical semantic information. This stage contributes to enhancing the model's understanding of hierarchical information in the original content. Following this, we design the multi-evidence alignment fusion module, which combines the generated textual and image evidence, fully leveraging information from different sources to improve extraction accuracy. Experimental results demonstrate that our model achieves F1 scores of 76.29 %, 87.66 %, and 87.34 % on the Twitter2015, Twitter2017, and MNRE datasets, respectively. These results surpass the previous state-of-the-art models by 0.29 %, 0.1 %, and 2.77 %. Furthermore, our model demonstrates superior performance in low-resource scenarios, confirming its effectiveness. The related code can be found at https://github.com/lsx314/HGMAF.