{"title":"Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications","authors":"I. Ding, C.-M. Ruan","doi":"10.2352/j.imagingsci.technol.2020.64.4.040404","DOIUrl":null,"url":null,"abstract":"Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially\n be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information\n obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and\n reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently\n performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design.\n Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve\n excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed\n approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.","PeriodicalId":15924,"journal":{"name":"Journal of Imaging Science and Technology","volume":"64 1","pages":"40404-1-40404-16"},"PeriodicalIF":0.6000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.2352/j.imagingsci.technol.2020.64.4.040404","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially
be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information
obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and
reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently
performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design.
Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve
excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed
approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.
期刊介绍:
Typical issues include research papers and/or comprehensive reviews from a variety of topical areas. In the spirit of fostering constructive scientific dialog, the Journal accepts Letters to the Editor commenting on previously published articles. Periodically the Journal features a Special Section containing a group of related— usually invited—papers introduced by a Guest Editor. Imaging research topics that have coverage in JIST include:
Digital fabrication and biofabrication;
Digital printing technologies;
3D imaging: capture, display, and print;
Augmented and virtual reality systems;
Mobile imaging;
Computational and digital photography;
Machine vision and learning;
Data visualization and analysis;
Image and video quality evaluation;
Color image science;
Image archiving, permanence, and security;
Imaging applications including astronomy, medicine, sports, and autonomous vehicles.