Software vulnerabilities are a major concern in cybersecurity, as even small flaws can expose systems to serious risks. Most automated detection methods focus on a single source–usually code–limiting their ability to capture the diverse ways vulnerabilities can appear and leaving complementary information underused. To address this limitation, this work investigates how heterogeneous modalities contribute to vulnerability detection and proposes a multimodal deep learning framework that integrates code semantics, commit messages, static metrics, and syntactic structures through an attention-based fusion mechanism. To ensure deployability in real-world scenarios where commit information is often unavailable at inference time, a teacher–student knowledge distillation scheme is employed. The multimodal teacher model, using all modalities, transfers its knowledge to a lightweight student model that relies solely on code features. Experiments on multiple vulnerability datasets show that multimodal fusion significantly improves detection performance and that knowledge distillation preserves much of this gain while enabling practical deployment. Our findings highlight the importance of cross-modal integration and distillation for building robust and scalable vulnerability detection systems.