An accurate understanding of noise perception is important for urban planning, noise management and public health. However, the visual and acoustic urban landscapes are intrinsically linked: the intricate interplay between what we see and hear shapes noise perception in the urban environment. To measure this complex and mixed effect, we conducted a mobility-based survey in Hong Kong with 800 participants, recording their noise exposure, noise perception and GPS trajectories. In addition, we acquired Google Street View images associated with each GPS trajectory point and extracted the urban visual environment from them. This study used a multi-sensory framework combined with XGBoost and Shapley additive interpretation (SHAP) models to construct an interpretable classification model for noise perception. Compared to relying solely on sound pressure levels, our model exhibited significant improvements in predicting noise perception, achieving a six-classification accuracy of approximately 0.75. Our findings revealed that the most influential factors affecting noise perception are the sound pressure levels and the proportion of buildings, plants, sky, and light intensity. Further, we discovered non-linear relationships between visual factors and noise perception: an excessive number of buildings exacerbated noise annoyance and stress levels and diminished objective noise perception at the same time. On the other hand, the presence of green plants mitigated the effect of noise on stress levels, but beyond a certain threshold, it led to worsened objective noise perception and noise annoyance instead. Our study provides insight into the objective and subjective perception of noise pressure, which contributes to advancing our understanding of complex and dynamic urban environments.