In the field of computer vision, the task of human action recognition (HAR) represents a challenge, due to the complexity of capturing nuanced human movements from video data. To address this issue, researchers have developed various algorithms. In this study, a novel two-stream architecture is developed that combines LSTM with a depthwise separable convolutional neural network (DSConV) and skeleton information, with the aim of enhancing the accuracy of HAR. The 3D coordinates of each joint in the skeleton are extracted using the Mediapipe library, and the 2D coordinates are obtained using MoveNet. The proposed method comprises two streams, called the temporal LSTM module and the joint-motion module, and was developed to overcome the limitations of prior two-stream RNN models, such as the vanishing gradient problem and the difficulty of effectively extracting temporal-spatial information. A performance evaluation on the benchmark datasets of JHMDB (73.31%), Florence-3D Action (97.67%), SBU Interaction (95.2%), and Penn Action (94.0%) showcases the effectiveness of the proposed model. A comparison with state-of-the-art methods demonstrates the superior performance of the approach on these datasets. This study contributes to advancing the field of HAR, with potential applications in surveillance and robotics.