As laser scanning welding technology matures in engineering applications, it is a crucial step in developing diagnostics capable of monitoring weld joint forming and meeting the demands of increasingly structurally complex products. In this work, a unique multivariate time-series dataset encompassing keyhole and molten pool image streams was extracted from the collected visual signals. Keyhole and molten pool were respectively fed into a proposed Transformer-based model with two-branches, which incorporated multi-head self-attention and cross-attention mechanisms. The results show that the optimal architecture achieved an accuracy of 99.3%, which outperforms the previous state-of-the-art image-based models. The optimization and ablation experiments have also verified that the temporal characteristics of signals are one of the significant determining factors for the accuracy of laser scanning welding state recognition. The score maps of attention mechanism during the decision-making process demonstrate that the proposed model is able to accurately learn the time-series characteristics of keyhole and molten pool visual signals, exhibiting exceptional capability in effectively capturing fine-grained details of highly dynamic objects from visual signals under varying welding states. In summary, its excellent performance and visualization of the attention mechanism make it a promising diagnostic functional module as a novel strategy for laser scanning welded Joint formation monitoring.