With the rapid development of industry and urbanization, air pollution has become a global environmental issue, with PM2.5 attracting widespread attention due to its severe impact on human health and the environment. Therefore, accurate prediction of PM2.5 concentration is crucial for environmental protection and public health. However, the nonlinear and multivariate characteristics of PM2.5 data pose challenges to prediction accuracy. To address this issue, we propose an innovative hybrid multivariate prediction model called WTCrossformer, which integrates wavelet transform convolution (WTC) to better extract local features and reduce the impact of noise on predictions. Additionally, the model employs dimension segment-wise embedding (DSW) and two-stage attention (TSA) mechanisms to capture temporal and cross-variable correlations in multivariate PM2.5 data, leveraging a hierarchical encoder-decoder structure to generate prediction results. This paper selects a multivariate time-series dataset from the UCI Machine Learning Repository. There are a total of 13 variables in this dataset, which details the air pollutant situations and meteorological conditions at 12 monitoring stations in the Beijing area over a 5-year period. Comparative experiments carried out on multiple PM2.5 datasets indicate that the model achieves relatively high prediction accuracy. It can accurately predict the trends of PM2.5 concentration, offering effective guidance for people’s daily life and health. Ablation experiments further confirm that the introduction of the WTC module significantly enhances the prediction accuracy. Our research provides strong technical support for environmental monitoring and pollution prediction.