Estimating the severity of dysarthria, a speech disorder from neurological conditions, is important in medicine. It helps with diagnosis, early detection, and personalized treatment. Significant progress has been made in leveraging SSL models as feature extractors for various classification tasks, demonstrating their effectiveness. Building on this, this paper examines whether using all features extracted from SSL models is necessary for optimal dysarthria severity classification from speech. We focused on layer-wise feature analysis of one base model, Wav2Vec2-base, and four large models, Wav2Vec2-large, HuBERT-large, Data2Vec-large, and WavLM-large, using a Convolutional Neural Network (CNN) as classifier with mel-frequency cepstral coefficients (MFCC) features as baseline. Experiments showed that the later transformer layers of the SSL models were more effective in the dysarthria severity classification, compared to the earlier layers. This is because the later transformer layers better capture articulation, and complex temporal patterns refined from the mid layers. More specifically, analysis revealed that embeddings from transformer encoder layer 23 of HuBERT-large yielded the best performance among all three models, possibly due to HuBERT’s hierarchical learning from unsupervised clustering. To further assess whether all dimensions are important, we examined the impact of varying feature dimensions. Our findings indicated that reducing the dimensions to 32 (from 1024 dimension) led to further improvements in accuracy. This indicates that not all features are necessary for effective severity classification. Additionally, feature fusion was conducted using the optimal reduced dimensions from the best-performing layer combined with varying dimensions of the MFCC features, resulting in further improvement in performance. The highest accuracy of 70.44% was achieved by combining 32 selected dimensions from the HuBERT-large model with 21 MFCC feature dimensions. The feature fusion of HuBERT-large (32) and MFCC (21) outperformed the HuBERT-large baseline by 6.36% and MFCC baseline by 15.28% in absolute. Furthermore, combining the fused features with handcrafted features from articulatory, prosodic, phonatory, and respiratory domains increased the classification accuracy to 73.53%, resulting in a more robust representation for dysarthria severity classification. Probing analyses of articulatory and prosodic features supported the choice of the best-performing HuBERT layer, while the low correlation with handcrafted features highlighted their complementary contribution. Finally, comparative t-SNE visualizations further validated the effectiveness of the proposed feature fusion, demonstrating clearer class separability.
扫码关注我们
求助内容:
应助结果提醒方式:
