用于视觉和视听短语识别系统的时空韦伯梯度方向特征

International Journal of Information Technology Pub Date : 2024-08-12 DOI:10.1007/s41870-024-02138-9

Salam Nandakishor, Debadatta Pati

{"title":"用于视觉和视听短语识别系统的时空韦伯梯度方向特征","authors":"Salam Nandakishor, Debadatta Pati","doi":"10.1007/s41870-024-02138-9","DOIUrl":null,"url":null,"abstract":"Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from \\(2\\times 5\\times 3\\) video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD\\(_{\\text {SLPM}}\\)) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD\\(_{\\text {SLPM}}\\) feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD\\(_{\\text {SLPM}}\\) visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spatio-temporal Weber Gradient Directional feature for visual and audio-visual phrase recognition systems\",\"authors\":\"Salam Nandakishor, Debadatta Pati\",\"doi\":\"10.1007/s41870-024-02138-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from \\\\(2\\\\times 5\\\\times 3\\\\) video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD\\\\(_{\\\\text {SLPM}}\\\\)) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD\\\\(_{\\\\text {SLPM}}\\\\) feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD\\\\(_{\\\\text {SLPM}}\\\\) visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.\",\"PeriodicalId\":14138,\"journal\":{\"name\":\"International Journal of Information Technology\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41870-024-02138-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02138-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视觉短语识别需要与嘴唇运动相关的视觉特征，而视听短语识别则需要声学和视觉特征。在这项工作中，我们提出了一种新的视觉特征：时空韦伯梯度方向（SWGD），以有效地表示嘴唇运动的微模式。所提出的视觉特征是通过使用微纹理信息、局部差异激励、梯度方向和梯度方向信息获得的。实验使用标准 OuluVS 数据库进行。采用了基于多项式内核的支持向量机（SVM）分类器，因为它能提供相对更好的性能。从视频块大小（2×5×3）中提取的 SWGD 性能更高，达到 73.9%。此外，我们还探索了人脸识别中常用的十二种不同的局部描述符，并首次将它们用于短语识别的比较研究中。SWGD 的性能优于这十二种不同的特征，但其维度高达 4320。通过使用软定位保护图（SLPM）将维度降低到 100，性能从 73.9% 提高到 81.3%。降维后的 SWGD（SWGD/(_{text {SLPM}}/)）优于本文提到的其他最先进的视觉特征。这表明了在所提出的特征中考虑到但在最先进的特征中被忽略的突出微纹理信息所带来的好处。我们观察到，SWGD（_{text {SLPM}}\）特征在表示不同短语的不同嘴唇运动模式方面具有很高的辨别能力。基于 Mel-frequency cepstral coefficient (MFCC) 的音频短语识别器的性能会随着信噪比的降低而降低。加入 SWGD\(_{text {SLPM}}\)视觉特征和声门 MFCC（GMFCC）激励源特征后，性能提高了 3.6%，这反映了噪声的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Spatio-temporal Weber Gradient Directional feature for visual and audio-visual phrase recognition systems

Visual phrase recognition needs lip movement related visual features, while audio-visual phrase recognition requires both acoustic and visual features. In this work, we propose a novel visual feature; Spatio-temporal Weber Gradient Directional (SWGD) to effectively represent the micro-patterns of lip movements. The proposed visual feature is obtained by using micro-texture information; local differential excitation, gradient orientation, and gradient directional information. Experiments are conducted using standard OuluVS database. Polynomial kernel based support vector machine (SVM) classifier is employed, as it provides relatively better performance. The SWGD extracted from \(2\times 5\times 3\) video block size provides higher performance of 73.9%. Additionally, we explore twelve distinct local descriptors commonly employed in face recognition and utilize them for the first time in a comparative study of phrase recognition. SWGD performs better than these twelve distinct features but has higher dimension of 4320. By reducing the dimension to 100 using the soft locality preserving map (SLPM), performance improved from 73.9 to 81.3%. The dimensionally reduced SWGD (SWGD\(_{\text {SLPM}}\)) outperforms other state-of-the-art visual features mentioned in this paper. This shows the benefit of the salient micro-texture information considered in the proposed feature but neglected in state-of-the-art features. We observe that the SWGD\(_{\text {SLPM}}\) feature has high discriminative ability to represent distinct lip movement patterns for different phrases. Mel-frequency cepstral coefficient (MFCC) based audio phrase recognizer performance degrades as the signal-to-noise level decreases. Including the SWGD\(_{\text {SLPM}}\) visual feature and Glottal MFCC (GMFCC) excitation source feature improves performance by 3.6%, reflecting noise robustness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Information Technology

自引率

0.00%

发文量