The composite faults in train traction motor bearings are diverse and often unknown. Current approaches heavily rely on extensive training datasets to guarantee the dependability of diagnostic outcomes. However, obtaining training samples of unknown composite faults in train bearings under real-world conditions is challenging. To tackle this problem, this study introduces a zero-shot diagnostic framework that utilizes acoustic signals captured by voiceprint sensors for diagnosing unknown composite faults. The approach builds upon a model that generates fault features from label feature vectors (LFV), enabling the diagnosis of unknown composite faults based on knowledge of single faults. First, a feature extraction approach using a spatially enhanced convolutional neural network is designed, introducing a spatial attention mechanism to strengthen the model’s attention to critical aspects of the features. Subsequently, a definition method for LFV is established to map the relationship between the extracted features and the LFV. A Wasserstein-generating adversarial network with a second-order dynamic gradient penalty is then proposed to generate virtual features based on the LFV. The designed second-order dynamic gradient penalty function helps the model explore the parameter space more efficiently and find the optimal solution, reducing the discrepancy between generated and real features. Finally, two independent acoustic datasets verified the model’s robustness. Without training on composite fault data, the model achieved a classification accuracy of 69.84% for four types of unknown composite faults in bearings, surpassing other methods for bearing composite fault diagnosis.