Purpose: To develop effective visual AI recognition models for epileptic spasm (ES), and to promote the professionalism and convenience of ES detection.
Method: We collected about 330 hours of infant motion videos with epileptic spasm in Children's Hospital, Zhejiang University School of Medicine from November 2022 to October 2024. A video-centered AI model was constructed, with a pre-trained Vision Transformer (ViT) via Contrastive Language-Image Pre-training (CLIP), serving as the core to extract spatial features. Additionally, a temporal convolution module was integrated to extract temporal information and a multi-layer perceptron was used to perform a normal-abnormal binary classification task. Focal loss was applied to mitigate class imbalance, prioritizing the learning of hard-to-classify samples. The model was trained for 100 epochs with 5-times random dataset splitting, in which the dataset was partitioned by individual infants to ensure disjoint training and test sets. Model performance was validated using metrics (including Precision, Recall, F-score, Accuracy, AUROC) based on test set results.
Results: The median age of ES onset was 0.4 (0.3, 0.7) years. All patients exhibited isolated or clustered epileptic spasms. By employing a CLIP-based classifier, the system reached a recall rate of 1.00 ± 0.00, a precision of 0.78 ± 0.01, an F-score of 0.87 ± 0.01, an accuracy of 0.98 ± 0.01, and an AUROC of 0.99 ± 0.01 in detecting epileptic spasm - outperforming previously reported methods. Case studies involving four infants' motion videos showed a high degree of consistency between the model's predictions and expert annotations. The model effectively distinguished ES episodes from normal patterns, even in videos with multiple intermittent ES segments.
Conclusion: We developed an automatic motion recognition model that holds significant potential in early automated detection of ES.

