Traditional control approaches heavily rely on hard-coded expert knowledge, complicating the development of optimal control solutions as system complexity increases. Deep Reinforcement Learning (DRL) offers a self-learning control solution, proving advantageous in scenarios where crafting expert-based solutions becomes intricate. This study investigates the potential of DRL for supervisory control in a unique and complex heating system within a large-scale university building. The DRL framework aims to minimize energy costs while ensuring occupant comfort. However, the trial-and-error learning approach of DRL raises concerns about the trustworthiness of executed actions, hindering practical implementation. To address this, the study incorporates action masking, enabling the integration of hard constraints into DRL to enhance user trust. Maskable Proximal Policy Optimization (MPPO) is evaluated alongside standard Proximal Policy Optimization (PPO) and Soft Actor–Critic (SAC). Simulation results reveal that MPPO achieves comparable energy savings (8% relative to the baseline control) with fewer comfort violations than other methods. Therefore, it is selected among the candidate algorithms and experimentally implemented in the university building over one week. Experimental findings demonstrate that MPPO reduces energy costs while maintaining occupant comfort, resulting in a 36% saving compared to a historical day with similar weather conditions. These results underscore the proactive decision-making capability of DRL, establishing its viability for autonomous control in complex energy systems.