Infrared (IR) thermography combined with Unmanned Aerial Vehicles (UAVs) offers an innovative approach for automated building façades inspections. However, extracting quantitative defect information from a single image poses a significant challenge. To address this, this paper introduces a Weakly-aligned Cross-modal Learning framework for subsurface defect segmentation using UAVs. This framework consists of two main components: the Multimodal Feature Description Network (MFDN) and the Prompt-aided Cross-modal Graph Learning (PCGL) algorithm. Initially, RGB–IR image pairs are processed by MFDN to extract feature descriptors for multi-modal alignment. The PCGL algorithm identifies visually critical areas through graph partitioning on a Wasserstein graph. These critical areas are transferred to the aligned IR image, and a Wasserstein Adjacency Graph (WAG) is constructed based on masked superpixel segmentation. Finally, the defects contours are pinpointed by detecting abnormal vertices of the WAG. The effectiveness is validated through controlled laboratory experiments and field applications on tiled façades.