Fine-scale population spatialization is the frontier of geosciences, it is essential for sustainable urban planning and effective resource allocation. Various approaches have been proposed to enhance population estimation accuracy using multi-source geospatial data. However, the approaches based on remote sensing data usually suffer from the problem of spatial homogeneity, while the social-sensing-based approach such as point of interest (POI) data cannot distinguish the population distribution around POIs with the same category but different scales. Thus, this study proposes a novel method that incorporates street view imagery (SVI) with POI, to enrich the semantic landscape of street-level objects and provide a visual representation of spatial heterogeneity within an urban environment. Specifically, we extract POI and SVI features at grid-, street-, and community-level, respectively, and then select modelling features based on cross-scale consistency analysis with population. After that, grid-level SVI features are adjusted by community-level SVI features to alleviate its sparsity and transiency. Finally, we train random forest (RF) at the street-level and estimate grid-level population weight for population allocation. Experiments in Wuhan City at a grid size of 100 × 100m show that our method yields higher accuracy compared to WorldPop, GPW datasets, Ye's method, and heterogeneous population attraction of POI modelling (HPA-POI), demonstrating its effectiveness in fine-scale population spatialization.