Individual trajectories, capturing significant human-environment interactions across space and time, serve as vital inputs for geospatial foundation models (GeoFMs). However, existing attempts at learning trajectory representations often encoded trajectory spatial-temporal relationships implicitly, which poses challenges in learning and representing spatiotemporal patterns accurately. Therefore, this paper proposes a joint spatial-temporal graph representation learning method (ST-GraphRL) to formalize structurally-explicit while learnable spatial-temporal dependencies into trajectory representations. The proposed ST-GraphRL consists of three compositions: (i) a weighted directed spatial-temporal graph to explicitly construct mobility interactions over space and time dimensions; (ii) a two-stage joint encoder (i.e., decoupling and fusion), to learn entangled spatial-temporal dependencies by independently decomposing and jointly aggregating features in space and time; (iii) a decoder guides ST-GraphRL to learn mobility regularities and randomness by simulating the spatial-temporal joint distributions of trajectories. Tested on three real-world human mobility datasets, the proposed ST-GraphRL outperformed all the baseline models in predicting movements' spatial-temporal distributions and preserving trajectory similarity with high spatial-temporal correlations. Furthermore, analyzing spatial-temporal features in latent space, it affirms that the ST-GraphRL can effectively capture underlying mobility patterns. The results may also provide insights into representation learnings of other geospatial data to achieve general-purpose data representations, promoting the progress of GeoFMs.