Emotion recognition plays a key role in the field of human–computer interaction. Classifying and predicting human emotions using electroencephalogram (EEG) signals has consistently been a challenging research area. Recently, with the increasing application of deep learning methods such as convolutional neural network (CNN) and channel attention mechanism (CA). The recognition accuracy of emotion recognition methods has already reached an outstanding level. However, CNN and its derivatives have the defect that the sensory field of view is small and can only extract local features. The traditional channel attention mechanism only focuses on the correlation between different channels and assigns weights to each channel according to its contribution to the emotion recognition task, ignoring the fact that different EEG frequency bands in the same channel signal also have different contributions to the task. To address the above-mentioned problems , this paper propose HA-CapsNet, a novel end-to-end model combining 3DCNN-CapsNet with a Hierarchical Attention mechanism. This model captures both inter-channel correlations and the contribution of each frequency band. Additionally, the capsule network in 3DCNN-CapsNet extracts more spatial feature information compared to conventional CNNs. Our HA-CapsNet achieves recognition accuracies of 97.40%, 97.20%, and 97.60% on the DEAP dataset, and 95.80%, 96.10%, and 96.30% on the DREAMER dataset, outperforming state-of-the-art methods with the smallest variance. Furthermore, experiments removing channels from the DEAP and DREAMER datasets in ascending order of their hierarchical attention weights showed that even with fewer channels, the model maintained strong recognition performance. This demonstrates HA-CapsNet’s low dependence on large datasets and its suitability for lightweight EEG devices, promoting advancements in EEG device development.