Medical Visual Question Answering (Med-VQA) is a critical multimodal task with the potential to address the scarcity and imbalance of medical resources. However, most existing studies overlook the limitations of the inconsistency in information density between medical images and text, as well as the long-tail distribution in datasets, which continue to make Med-VQA an open challenge. To overcome these issues, this study proposes a Language-Guided Progressive Fusion Network (LGPFN) with three key modules: Question-Guided Progressive Multimodal Fusion (QPMF), Language-Gate Mechanism (LGM), and Triple Semantic Feature Alignment (TriSFA). QPMF progressively guides the fusion of visual and textual features using both global and local question representations. LGM, a linguistic rule-based module, distinguishes between Closed-Ended (CE) and Open-Ended (OE) samples, directing the fused features to the appropriate classifiers. Finally, TriSFA captures the rich semantic information of OE answers and mine the underlying associations among fused features, predicted answers, and ground truths, aligning them in a ternary semantic feature space. The proposed LGPFN framework outperforms existing state-of-the-art models, achieving the best overall accuracies of 80.39%, 84.07%, 75.74%, and 70.60% on the VQA-RAD, SLAKE, PathVQA, and VQA-Med 2019 datasets, respectively. These results demonstrate the effectiveness and generalizability of the proposed model, underscoring its potential as a medical Artificial Intelligent (AI) agent that could benefit universal health coverage.