During conversation, speakers adjust their linguistic characteristics to become more similar to their partners. This complex phenomenon is known as entrainment, and speakers dynamically entrain as well as disentrain on different linguistic features. Researchers have utilized a range of computational methods to explore entrainment. Recent technological advancements have facilitated the use of deep learning, which offers a systematic quantification of acoustic entrainment dynamics. In this study, we investigate the capability of deep learning architectures to extract and leverage textual features for the efficient representation and learning of entrainment. By adjusting the architecture of an acoustic-based DNN entrainment model, we present an unsupervised deep learning framework that derives representations from textual features containing relevant information for identifying entrainment at three linguistic levels: lexical, syntactic, and semantic. To investigate the performance of each model within the proposed framework, various text-based and speech features were extracted. Entrainment was quantified using different distance measures in the representation space. The performance of the trained models was evaluated by distinguishing real and sham conversations using the proposed distances. Our results suggest that acoustic-based DNN models outperform text-based DNN models and that distance measures affect the models’ performance.
扫码关注我们
求助内容:
应助结果提醒方式:
