Travel time prediction has important influence on the overall control of urban Intelligent Transportation Systems (ITS). Urban arterial networks are typically composed of links and intersections, where each link or intersection can be regarded as a spatial node within the network. However, existing researches predominantly focus on modeling spatial nodes in the link modality to predict travel times in urban arterial networks, neglecting the potential correlations among heterogeneous modal nodes. To overcome these limitations, we propose a Heterogeneous Multi-Modal Graph Neural Network (HMGNN) specifically tailored for travel time prediction in arterial networks. Specifically, we innovatively construct spatial correlation graphs that capture the unique traffic characteristics of intersection modal nodes. Furthermore, we design a cross-modal graph generator that captures the latent spatiotemporal features between spatial nodes of distinct modalities, resulting in the generation of heterogeneous modal graphs. Finally, our proposed HMGNN model incorporates tailored network structures for graphs of varying complexities, enabling targeted mining of their inherent information to derive the final prediction results. Extensive experiments conducted using real-world traffic data from Zhangzhou, China, demonstrate that our HMGNN model achieves significant improvements in prediction accuracy.