Humans are known to have a better subconscious impression of other humans when their movements are imitated in social interactions. Despite this influential phenomenon, its application in human-computer interaction is currently limited to specific areas, such as an agent mimicking the head movements of a user in virtual reality, because capturing user movements conventionally requires external sensors. If we can implement the mimicry effect in a scalable platform without such sensors, a new approach for designing human-computer interaction will be introduced. Therefore, we have investigated whether users feel positively toward a mimicking agent that is delivered by a standalone web application using only a webcam. We also examined whether a web page that changes its background pattern based on head movements can foster a favorable impression. The positive effect confirmed in our experiments supports mimicry as a novel design practice to augment our daily browsing experiences.
{"title":"Mimicker-in-the-Browser: A Novel Interaction Using Mimicry to Augment the Browsing Experience","authors":"Riku Arakawa, Hiromu Yakura","doi":"10.1145/3382507.3418811","DOIUrl":"https://doi.org/10.1145/3382507.3418811","url":null,"abstract":"Humans are known to have a better subconscious impression of other humans when their movements are imitated in social interactions. Despite this influential phenomenon, its application in human-computer interaction is currently limited to specific areas, such as an agent mimicking the head movements of a user in virtual reality, because capturing user movements conventionally requires external sensors. If we can implement the mimicry effect in a scalable platform without such sensors, a new approach for designing human-computer interaction will be introduced. Therefore, we have investigated whether users feel positively toward a mimicking agent that is delivered by a standalone web application using only a webcam. We also examined whether a web page that changes its background pattern based on head movements can foster a favorable impression. The positive effect confirmed in our experiments supports mimicry as a novel design practice to augment our daily browsing experiences.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"73 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134545486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest. This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support. The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and
{"title":"Human-centered Multimodal Machine Intelligence","authors":"Shrikanth S. Narayanan","doi":"10.1145/3382507.3417974","DOIUrl":"https://doi.org/10.1145/3382507.3417974","url":null,"abstract":"Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest. This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support. The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and ","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114499536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haley Lepp, C. W. Leong, K. Roohr, Michelle P. Martín‐Raugh, Vikram Ramanarayanan
We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.
{"title":"Effect of Modality on Human and Machine Scoring of Presentation Videos","authors":"Haley Lepp, C. W. Leong, K. Roohr, Michelle P. Martín‐Raugh, Vikram Ramanarayanan","doi":"10.1145/3382507.3418880","DOIUrl":"https://doi.org/10.1145/3382507.3418880","url":null,"abstract":"We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121668554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Ding, Radha Kumaran, Tianjiao Yang, Tobias Höllerer
Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.
{"title":"Predicting Video Affect via Induced Affection in the Wild","authors":"Yi Ding, Radha Kumaran, Tianjiao Yang, Tobias Höllerer","doi":"10.1145/3382507.3418838","DOIUrl":"https://doi.org/10.1145/3382507.3418838","url":null,"abstract":"Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117039026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.
{"title":"Temporal Attention and Consistency Measuring for Video Question Answering","authors":"Lingyu Zhang, R. Radke","doi":"10.1145/3382507.3418886","DOIUrl":"https://doi.org/10.1145/3382507.3418886","url":null,"abstract":"Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115081898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oswald Barral, Sébastien Lallé, Grigorii Guz, A. Iranpour, C. Conati
We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.
{"title":"Eye-Tracking to Predict User Cognitive Abilities and Performance for User-Adaptive Narrative Visualizations","authors":"Oswald Barral, Sébastien Lallé, Grigorii Guz, A. Iranpour, C. Conati","doi":"10.1145/3382507.3418884","DOIUrl":"https://doi.org/10.1145/3382507.3418884","url":null,"abstract":"We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121492281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Vidal, Ali N. Salman, Wei-Cheng Lin, C. Busso
Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.
{"title":"MSP-Face Corpus: A Natural Audiovisual Emotional Database","authors":"Andrea Vidal, Ali N. Salman, Wei-Cheng Lin, C. Busso","doi":"10.1145/3382507.3418872","DOIUrl":"https://doi.org/10.1145/3382507.3418872","url":null,"abstract":"Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Shubham, E. Kleinlogel, Anaïs Butera, M. S. Mast, D. Jayagopi
With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.
{"title":"Conventional and Non-conventional Job Interviewing Methods: A Comparative Study in Two Countries","authors":"K. Shubham, E. Kleinlogel, Anaïs Butera, M. S. Mast, D. Jayagopi","doi":"10.1145/3382507.3418824","DOIUrl":"https://doi.org/10.1145/3382507.3418824","url":null,"abstract":"With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123838026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Mahbubur Rahman, M. Y. Ahmed, Tousif Ahmed, Bashima Islam, Viswam Nathan, K. Vatanparvar, Ebrahim Nemati, Daniel McCaffrey, Jilong Kuang, J. Gao
Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.
{"title":"BreathEasy: Assessing Respiratory Diseases Using Mobile Multimodal Sensors","authors":"Md. Mahbubur Rahman, M. Y. Ahmed, Tousif Ahmed, Bashima Islam, Viswam Nathan, K. Vatanparvar, Ebrahim Nemati, Daniel McCaffrey, Jilong Kuang, J. Gao","doi":"10.1145/3382507.3418852","DOIUrl":"https://doi.org/10.1145/3382507.3418852","url":null,"abstract":"Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128290115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.
{"title":"X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild","authors":"Lukas Stappen, Georgios Rizos, Björn Schuller","doi":"10.1145/3382507.3417967","DOIUrl":"https://doi.org/10.1145/3382507.3417967","url":null,"abstract":"Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125539930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}