The precision monitoring of film-mulched winter wheat growth facilitates field management optimization and further improves yield. Unmanned aerial vehicle (UAV) is an effective tool for crop monitoring at the field scale. However, due to the interference of background effects caused by soil and mulch, achieving accurate monitoring of crop growth in complex backgrounds for UAV remains a challenge. Additionally, the simultaneous inversion of multiple growth parameters helped us to comprehensively monitor the overall crop growth status. This study conducted field experiments including three winter wheat mulching treatments: ridge mulching, ridge–furrow full-mulching, and flat cropping full-mulching. Three machine learning algorithms (partial least squares, ridge regression, and support vector machines) and deep neural network were employed to process the vegetation indices (VIs) feature data, and the residual neural network 50 (ResNet 50) was used to process the image data. Then the two modalities (VI feature data and image data) were fused to obtain a multi-modal fusion (MMF) model. Meanwhile, a film-mulched winter wheat growth monitoring model that simultaneously predicted leaf area index (LAI), aboveground biomass (AGB), plant height (PH), and leaf chlorophyll content (LCC) was constructed by coupling multi-task learning techniques. The results showed that the image-based ResNet 50 outperformed the VI feature-based model. The MMF improved prediction accuracy for LAI, AGB, PH, and LCC with coefficients of determination of 0.73–0.92, mean absolute errors of 0.29–3.89 and relative root mean square errors of 9.48–12.99%. A multi-task MMF model with the same loss weight distribution ([1/4, 1/4, 1/4, 1/4]) achieved comparable accuracy to the single-task MMF model, improving training efficiency and providing excellent generalization to different film-mulched sample areas. The novel technique of the multi-task MMF model proposed in this study provides an accurate and comprehensive method for monitoring the growth status of film-mulched winter wheat.