Emotional expressions in speech are often recognized rapidly, highlighting the central role of prosodic cues. However, less is known about how accurately listeners perceive emotion in AI-generated voices, particularly in relatively low-resource languages such as Korean. This study examined recognition of four emotions – Happy, Sad, Angry, and Anxious – in human- and AI-generated Korean speech. Thirty-six Korean listeners judged both the voice type (human vs. AI) and the emotional content of 64 utterances (32 human, 32 AI-generated), with lexical content controlled and stimuli matched by speaker. Human voices yielded higher recognition accuracy and faster response times than the AI voices. Among the emotions, Happy was recognized most accurately and Anxious least accurately, with the difficulty for Anxious especially pronounced in AI speech. Random forest analyses further revealed differences in cue reliance: recognition of human Anxious speech drew on a combination of pitch and intensity cues, whereas recognition of the AI Anxious speech was driven primarily by intensity variability, reflecting reliance on more salient and exaggerated acoustic features. These findings suggest that distinctions between human and AI-generated speech become most evident with complex emotions, as listeners are more sensitive to the subtle prosodic cues provided by human voices but often missing from synthetic ones.
扫码关注我们
求助内容:
应助结果提醒方式:
