The stability of power systems is paramount to industrial operations. The deleterious inherent characteristics of false data injection attacks (FDIA) have drawn substantial interest due to their severe threats to power grids. Contemporary detection systems face numerous challenges as attackers employ various tactics, such as injecting complex elements into measurement data and formulating quick attack strategies against critical nodes and transmission lines in the power grid network topology. Conventional models often fail to adapt to the intricacies of practical situations because they focus predominantly on detecting individual components. To overcome the above predicaments, this paper proposes a lightweight detection model integrating deep separable convolutional layers, squeeze neural networks, and a bidirectional long short-term memory architecture named DSE-BiLSTM. The acquisition process of network topological characteristics is accomplished through variable graph attention autoencoder (VGAAE). This approach leverages the effectiveness of the graph convolution (GCN) layer to acquire each node’s topological feature and the graph attention (GAT) module to identify and extract the topological features of critical nodes. Furthermore, the topology information obtained by the both techniques is embedded in one-dimensional vector space in the same form as measurement data. By combining the output of VGAAE with meter measurements, the feature fusion of temporal and spatial modalities is realized. DSE-BiLSTM with optimal hyperparameters achieves an F1-score of 99.56% and a row accuracy (RACC) of 93.10% on the conventional dataset. The experimental results of FDIA detection with composite datasets of IEEE 14-bus and IEEE 118-bus systems show that the F1-score and RACC of DSE-BiLSTM remain above 84.51% and 83.56% under various attack strengths and noise levels. In addition, as the power grid network scales up, noise level’s effect on detection performance decreases, while attack strength’s effect on recognition capability increases. DSE-BiLSTM can effectively process the composite data of spatiotemporal multimodes and provides a feasible solution for the localization and detection of FDIA in realistic scenes.