评价人机交互场景中检测单词突出的光流场特征

2015 International Joint Conference on Neural Networks (IJCNN) Pub Date : 2015-07-12 DOI:10.1109/IJCNN.2015.7280639

Andrea Schnall, M. Heckmann

{"title":"评价人机交互场景中检测单词突出的光流场特征","authors":"Andrea Schnall, M. Heckmann","doi":"10.1109/IJCNN.2015.7280639","DOIUrl":null,"url":null,"abstract":"In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.","PeriodicalId":6539,"journal":{"name":"2015 International Joint Conference on Neural Networks (IJCNN)","volume":"36 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Evaluation of optical flow field features for the detection of word prominence in a human-machine interaction scenario\",\"authors\":\"Andrea Schnall, M. Heckmann\",\"doi\":\"10.1109/IJCNN.2015.7280639\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.\",\"PeriodicalId\":6539,\"journal\":{\"name\":\"2015 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"36 1\",\"pages\":\"1-7\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN.2015.7280639\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2015.7280639","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文研究了用于单词突出自动标注的光流场特征。视觉运动是一个丰富的信息源。修改发音参数以提高话语中某一段的突出程度，通常伴随着嘴和头部的运动比不突出的部分更强烈。描述这种运动的一种方法是使用光流场。在记录我们用于后续实验的视听数据库的过程中，我们要求受试者只使用韵律线索来纠正系统中单个单词的误解，这创造了一个狭窄和广泛的焦点。用远处的麦克风进行视听记录，不做任何视觉标记。作为声学特征，计算了持续时间、响度、基频和频谱重点。从视觉通道中检测鼻子位置并提取口腔区域。从这个区域计算光流，总结出一个字的所有光流场。然后将四个方向的汇聚光流用作特征向量。我们证明，除了音频特征之外，使用这些特征可以改善一些说话者的分类结果。我们还将光流场特征与其他视觉特征、鼻子位置和基于图像变换的视觉特征进行了比较。光流场特征包含的信息不如基于图像变换的视觉特征多，但除了音频特征外，还使用这两种特征会导致总体上最好的结果，这表明它们包含互补的信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluation of optical flow field features for the detection of word prominence in a human-machine interaction scenario

In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Joint Conference on Neural Networks (IJCNN)

自引率

0.00%

发文量