Fiseha B. Tesema, J. Gu, Wei Song, Hong-Chuan Wu, Shiqiang Zhu, Zheyuan Lin, Min Huang, Wen Wang, R. Kumar
{"title":"Addressee Detection Using Facial and Audio Features in Mixed Human–Human and Human–Robot Settings: A Deep Learning Framework","authors":"Fiseha B. Tesema, J. Gu, Wei Song, Hong-Chuan Wu, Shiqiang Zhu, Zheyuan Lin, Min Huang, Wen Wang, R. Kumar","doi":"10.1109/MSMC.2022.3224843","DOIUrl":null,"url":null,"abstract":"Addressee detection (AD) enables robots to interact smoothly with a human by distinguishing whether it is being addressed. However, this has not been widely explored. The few studies that have explored this area focused on a human-to-human or human-to-robot conversation confined inside a meeting room using gaze and utterance. These works used statistical and rule-based approaches, which tend to depend on specific settings. Further, they did not fully leverage the available audio and visual information or the short-term and long-term segments, and they have not explored combining important conversation cues—the facial and audio features. In addition, no audiovisual spatiotemporal annotated dataset captured in mixed human-to-human and human-to-robot settings is available to support exploring the area using new approaches.","PeriodicalId":43649,"journal":{"name":"IEEE Systems Man and Cybernetics Magazine","volume":"13 1","pages":"25-38"},"PeriodicalIF":1.9000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Systems Man and Cybernetics Magazine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSMC.2022.3224843","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
引用次数: 0
Abstract
Addressee detection (AD) enables robots to interact smoothly with a human by distinguishing whether it is being addressed. However, this has not been widely explored. The few studies that have explored this area focused on a human-to-human or human-to-robot conversation confined inside a meeting room using gaze and utterance. These works used statistical and rule-based approaches, which tend to depend on specific settings. Further, they did not fully leverage the available audio and visual information or the short-term and long-term segments, and they have not explored combining important conversation cues—the facial and audio features. In addition, no audiovisual spatiotemporal annotated dataset captured in mixed human-to-human and human-to-robot settings is available to support exploring the area using new approaches.