Pub Date : 2024-11-01Epub Date: 2024-11-04DOI: 10.1145/3678957.3685711
Muhittin Gokmen, Evangelos Sariyanidi, Lisa Yankowitz, Casey J Zampella, Robert T Schultz, Birkan Tunç
Head movements play a crucial role in social interactions. The quantification of communicative movements such as nodding, shaking, orienting, and backchanneling is significant in behavioral and mental health research. However, automated localization of such head movements within videos remains challenging in computer vision due to their arbitrary start and end times, durations, and frequencies. In this work, we introduce a novel and efficient coding system for head movements, grounded in Birdwhistell's kinesics theory, to automatically identify basic head motion units such as nodding and shaking. Our approach first defines the smallest unit of head movement, termed kine, based on the anatomical constraints of the neck and head. We then quantify the location, magnitude, and duration of kines within each angular component of head movement. Through defining possible combinations of identified kines, we define a higher-level construct, kineme, which corresponds to basic head motion units such as nodding and shaking. We validate the proposed framework by predicting autism spectrum disorder (ASD) diagnosis from video recordings of interacting partners. We show that the multi-scale property of the proposed framework provides a significant advantage, as collapsing behavior across temporal scales reduces performance consistently. Finally, we incorporate another fundamental behavioral modality, namely speech, and show that distinguishing between speaking- and listening-time head movementsments significantly improves ASD classification performance.
{"title":"Detecting Autism from Head Movements using Kinesics.","authors":"Muhittin Gokmen, Evangelos Sariyanidi, Lisa Yankowitz, Casey J Zampella, Robert T Schultz, Birkan Tunç","doi":"10.1145/3678957.3685711","DOIUrl":"https://doi.org/10.1145/3678957.3685711","url":null,"abstract":"<p><p>Head movements play a crucial role in social interactions. The quantification of communicative movements such as nodding, shaking, orienting, and backchanneling is significant in behavioral and mental health research. However, automated localization of such head movements within videos remains challenging in computer vision due to their arbitrary start and end times, durations, and frequencies. In this work, we introduce a novel and efficient coding system for head movements, grounded in Birdwhistell's kinesics theory, to automatically identify basic head motion units such as nodding and shaking. Our approach first defines the smallest unit of head movement, termed <i>kine</i>, based on the anatomical constraints of the neck and head. We then quantify the location, magnitude, and duration of <i>kines</i> within each angular component of head movement. Through defining possible combinations of identified <i>kines</i>, we define a higher-level construct, <i>kineme</i>, which corresponds to basic head motion units such as nodding and shaking. We validate the proposed framework by predicting autism spectrum disorder (ASD) diagnosis from video recordings of interacting partners. We show that the multi-scale property of the proposed framework provides a significant advantage, as collapsing behavior across temporal scales reduces performance consistently. Finally, we incorporate another fundamental behavioral modality, namely speech, and show that distinguishing between speaking- and listening-time head movementsments significantly improves ASD classification performance.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2024 ","pages":"350-354"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11542642/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandria K Vail, Jeffrey M Girard, Lauren M Bylsma, Jeffrey F Cohn, Jay Fournier, Holly A Swartz, Louis-Philippe Morency
The relationship between a therapist and their client is one of the most critical determinants of successful therapy. The working alliance is a multifaceted concept capturing the collaborative aspect of the therapist-client relationship; a strong working alliance has been extensively linked to many positive therapeutic outcomes. Although therapy sessions are decidedly multimodal interactions, the language modality is of particular interest given its recognized relationship to similar dyadic concepts such as rapport, cooperation, and affiliation. Specifically, in this work we study language entrainment, which measures how much the therapist and client adapt toward each other's use of language over time. Despite the growing body of work in this area, however, relatively few studies examine causal relationships between human behavior and these relationship metrics: does an individual's perception of their partner affect how they speak, or does how they speak affect their perception? We explore these questions in this work through the use of structural equation modeling (SEM) techniques, which allow for both multilevel and temporal modeling of the relationship between the quality of the therapist-client working alliance and the participants' language entrainment. In our first experiment, we demonstrate that these techniques perform well in comparison to other common machine learning models, with the added benefits of interpretability and causal analysis. In our second analysis, we interpret the learned models to examine the relationship between working alliance and language entrainment and address our exploratory research questions. The results reveal that a therapist's language entrainment can have a significant impact on the client's perception of the working alliance, and that the client's language entrainment is a strong indicator of their perception of the working alliance. We discuss the implications of these results and consider several directions for future work in multimodality.
{"title":"Toward Causal Understanding of Therapist-Client Relationships: A Study of Language Modality and Social Entrainment.","authors":"Alexandria K Vail, Jeffrey M Girard, Lauren M Bylsma, Jeffrey F Cohn, Jay Fournier, Holly A Swartz, Louis-Philippe Morency","doi":"10.1145/3536221.3556616","DOIUrl":"https://doi.org/10.1145/3536221.3556616","url":null,"abstract":"<p><p>The relationship between a therapist and their client is one of the most critical determinants of successful therapy. The <i>working alliance</i> is a multifaceted concept capturing the collaborative aspect of the therapist-client relationship; a strong working alliance has been extensively linked to many positive therapeutic outcomes. Although therapy sessions are decidedly multimodal interactions, the language modality is of particular interest given its recognized relationship to similar dyadic concepts such as rapport, cooperation, and affiliation. Specifically, in this work we study <i>language entrainment</i>, which measures how much the therapist and client adapt toward each other's use of language over time. Despite the growing body of work in this area, however, relatively few studies examine <i>causal</i> relationships between human behavior and these relationship metrics: does an individual's perception of their partner affect how they speak, or does how they speak affect their perception? We explore these questions in this work through the use of structural equation modeling (SEM) techniques, which allow for both multilevel and temporal modeling of the relationship between the quality of the therapist-client working alliance and the participants' language entrainment. In our first experiment, we demonstrate that these techniques perform well in comparison to other common machine learning models, with the added benefits of interpretability and causal analysis. In our second analysis, we interpret the learned models to examine the relationship between working alliance and language entrainment and address our exploratory research questions. The results reveal that a therapist's language entrainment can have a significant impact on the client's perception of the working alliance, and that the client's language entrainment is a strong indicator of their perception of the working alliance. We discuss the implications of these results and consider several directions for future work in multimodality.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2022 ","pages":"487-494"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9999472/pdf/nihms-1879155.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9110473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weichen Wang, Jialing Wu, Subigya Nepal, Alex daSilva, Elin Hedlund, Eilis Murphy, Courtney Rogers, Jeremy Huckins
Pandemics significantly impact human daily life. People throughout the world adhere to safety protocols (e.g., social distancing and self-quarantining). As a result, they willingly keep distance from workplace, friends and even family. In such circumstances, in-person social interactions may be substituted with virtual ones via online channels, such as, Instagram and Snapchat. To get insights into this phenomenon, we study a group of undergraduate students before and after the start of COVID-19 pandemic. Specifically, we track N=102 undergraduate students on a small college campus prior to the pandemic using mobile sensing from phones and assign semantic labels to each location they visit on campus where they study, socialize and live. By leveraging their colocation network at these various semantically labeled places on campus, we find that colocations at certain places that possibly proxy higher in-person social interactions (e.g., dormitories, gyms and Greek houses) show significant predictive capability in identifying the individuals' change in social media usage during the pandemic period. We show that we can predict student's change in social media usage during COVID-19 with an F1 score of 0.73 purely from the in-person colocation data generated prior to the pandemic.
{"title":"On the Transition of Social Interaction from In-Person to Online: Predicting Changes in Social Media Usage of College Students during the COVID-19 Pandemic based on Pre-COVID-19 On-Campus Colocation.","authors":"Weichen Wang, Jialing Wu, Subigya Nepal, Alex daSilva, Elin Hedlund, Eilis Murphy, Courtney Rogers, Jeremy Huckins","doi":"10.1145/3462244.3479888","DOIUrl":"https://doi.org/10.1145/3462244.3479888","url":null,"abstract":"<p><p>Pandemics significantly impact human daily life. People throughout the world adhere to safety protocols (e.g., social distancing and self-quarantining). As a result, they willingly keep distance from workplace, friends and even family. In such circumstances, in-person social interactions may be substituted with virtual ones via online channels, such as, Instagram and Snapchat. To get insights into this phenomenon, we study a group of undergraduate students before and after the start of COVID-19 pandemic. Specifically, we track N=102 undergraduate students on a small college campus prior to the pandemic using mobile sensing from phones and assign semantic labels to each location they visit on campus where they study, socialize and live. By leveraging their colocation network at these various semantically labeled places on campus, we find that colocations at certain places that possibly proxy higher in-person social interactions (e.g., dormitories, gyms and Greek houses) show significant predictive capability in identifying the individuals' change in social media usage during the pandemic period. We show that we can predict student's change in social media usage during COVID-19 with an F1 score of 0.73 purely from the in-person colocation data generated prior to the pandemic.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2021 ","pages":"425-434"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9747327/pdf/nihms-1855031.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10641903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Torsten Wörtwein, Lisa B Sheeber, Nicholas Allen, Jeffrey F Cohn, Louis-Philippe Morency
This paper studies the hypothesis that not all modalities are always needed to predict affective states. We explore this hypothesis in the context of recognizing three affective states that have shown a relation to a future onset of depression: positive, aggressive, and dysphoric. In particular, we investigate three important modalities for face-to-face conversations: vision, language, and acoustic modality. We first perform a human study to better understand which subset of modalities people find informative, when recognizing three affective states. As a second contribution, we explore how these human annotations can guide automatic affect recognition systems to be more interpretable while not degrading their predictive performance. Our studies show that humans can reliably annotate modality informativeness. Further, we observe that guided models significantly improve interpretability, i.e., they attend to modalities similarly to how humans rate the modality informativeness, while at the same time showing a slight increase in predictive performance.
{"title":"Human-Guided Modality Informativeness for Affective States.","authors":"Torsten Wörtwein, Lisa B Sheeber, Nicholas Allen, Jeffrey F Cohn, Louis-Philippe Morency","doi":"10.1145/3462244.3481004","DOIUrl":"https://doi.org/10.1145/3462244.3481004","url":null,"abstract":"<p><p>This paper studies the hypothesis that not all modalities are always needed to predict affective states. We explore this hypothesis in the context of recognizing three affective states that have shown a relation to a future onset of depression: positive, aggressive, and dysphoric. In particular, we investigate three important modalities for face-to-face conversations: vision, language, and acoustic modality. We first perform a human study to better understand which subset of modalities people find informative, when recognizing three affective states. As a second contribution, we explore how these human annotations can guide automatic affect recognition systems to be more interpretable while not degrading their predictive performance. Our studies show that humans can reliably annotate modality informativeness. Further, we observe that guided models significantly improve interpretability, i.e., they attend to modalities similarly to how humans rate the modality informativeness, while at the same time showing a slight increase in predictive performance.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":" ","pages":"728-734"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8812829/pdf/nihms-1770971.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39895427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zakia Hammal, Di Huang, Kévin Bailly, Liming Chen, Mohamed Daoudi
The goal of Face and Gesture Analysis for Health Informatics's workshop is to share and discuss the achievements as well as the challenges in using computer vision and machine learning for automatic human behavior analysis and modeling for clinical research and healthcare applications. The workshop aims to promote current research and support growth of multidisciplinary collaborations to advance this groundbreaking research. The meeting gathers scientists working in related areas of computer vision and machine learning, multi-modal signal processing and fusion, human centered computing, behavioral sensing, assistive technologies, and medical tutoring systems for healthcare applications and medicine.
{"title":"Face and Gesture Analysis for Health Informatics.","authors":"Zakia Hammal, Di Huang, Kévin Bailly, Liming Chen, Mohamed Daoudi","doi":"10.1145/3382507.3419747","DOIUrl":"10.1145/3382507.3419747","url":null,"abstract":"<p><p>The goal of Face and Gesture Analysis for Health Informatics's workshop is to share and discuss the achievements as well as the challenges in using computer vision and machine learning for automatic human behavior analysis and modeling for clinical research and healthcare applications. The workshop aims to promote current research and support growth of multidisciplinary collaborations to advance this groundbreaking research. The meeting gathers scientists working in related areas of computer vision and machine learning, multi-modal signal processing and fusion, human centered computing, behavioral sensing, assistive technologies, and medical tutoring systems for healthcare applications and medicine.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2020 ","pages":"874-875"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7710162/pdf/nihms-1643015.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38676363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Victoria Lin, Jeffrey M Girard, Michael A Sayette, Louis-Philippe Morency
Emotional expressiveness captures the extent to which a person tends to outwardly display their emotions through behavior. Due to the close relationship between emotional expressiveness and behavioral health, as well as the crucial role that it plays in social interaction, the ability to automatically predict emotional expressiveness stands to spur advances in science, medicine, and industry. In this paper, we explore three related research questions. First, how well can emotional expressiveness be predicted from visual, linguistic, and multimodal behavioral signals? Second, how important is each behavioral modality to the prediction of emotional expressiveness? Third, which behavioral signals are reliably related to emotional expressiveness? To answer these questions, we add highly reliable transcripts and human ratings of perceived emotional expressiveness to an existing video database and use this data to train, validate, and test predictive models. Our best model shows promising predictive performance on this dataset (RMSE = 0.65, R2 = 0.45, r = 0.74). Multimodal models tend to perform best overall, and models trained on the linguistic modality tend to outperform models trained on the visual modality. Finally, examination of our interpretable models' coefficients reveals a number of visual and linguistic behavioral signals-such as facial action unit intensity, overall word count, and use of words related to social processes-that reliably predict emotional expressiveness.
{"title":"Toward Multimodal Modeling of Emotional Expressiveness.","authors":"Victoria Lin, Jeffrey M Girard, Michael A Sayette, Louis-Philippe Morency","doi":"10.1145/3382507.3418887","DOIUrl":"10.1145/3382507.3418887","url":null,"abstract":"<p><p>Emotional expressiveness captures the extent to which a person tends to outwardly display their emotions through behavior. Due to the close relationship between emotional expressiveness and behavioral health, as well as the crucial role that it plays in social interaction, the ability to automatically predict emotional expressiveness stands to spur advances in science, medicine, and industry. In this paper, we explore three related research questions. First, how well can emotional expressiveness be predicted from visual, linguistic, and multimodal behavioral signals? Second, how important is each behavioral modality to the prediction of emotional expressiveness? Third, which behavioral signals are reliably related to emotional expressiveness? To answer these questions, we add highly reliable transcripts and human ratings of perceived emotional expressiveness to an existing video database and use this data to train, validate, and test predictive models. Our best model shows promising predictive performance on this dataset (<i>RMSE</i> = 0.65, <i>R</i> <sup>2</sup> = 0.45, <i>r</i> = 0.74). Multimodal models tend to perform best overall, and models trained on the linguistic modality tend to outperform models trained on the visual modality. Finally, examination of our interpretable models' coefficients reveals a number of visual and linguistic behavioral signals-such as facial action unit intensity, overall word count, and use of words related to social processes-that reliably predict emotional expressiveness.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2020 ","pages":"548-557"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8106384/pdf/nihms-1680572.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38966276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Muszynski, Jamie Zelazny, Jeffrey M Girard, Louis-Philippe Morency
Recent progress in artificial intelligence has led to the development of automatic behavioral marker recognition, such as facial and vocal expressions. Those automatic tools have enormous potential to support mental health assessment, clinical decision making, and treatment planning. In this paper, we investigate nonverbal behavioral markers of depression severity assessed during semi-structured medical interviews of adolescent patients. The main goal of our research is two-fold: studying a unique population of adolescents at high risk of mental disorders and differentiating mild depression from moderate or severe depression. We aim to explore computationally inferred facial and vocal behavioral responses elicited by three segments of the semi-structured medical interviews: Distress Assessment Questions, Ubiquitous Questions, and Concept Questions. Our experimental methodology reflects best practise used for analyzing small sample size and unbalanced datasets of unique patients. Our results show a very interesting trend with strongly discriminative behavioral markers from both acoustic and visual modalities. These promising results are likely due to the unique classification task (mild depression vs. moderate and severe depression) and three types of probing questions.
{"title":"Depression Severity Assessment for Adolescents at High Risk of Mental Disorders.","authors":"Michal Muszynski, Jamie Zelazny, Jeffrey M Girard, Louis-Philippe Morency","doi":"10.1145/3382507.3418859","DOIUrl":"10.1145/3382507.3418859","url":null,"abstract":"<p><p>Recent progress in artificial intelligence has led to the development of automatic behavioral marker recognition, such as facial and vocal expressions. Those automatic tools have enormous potential to support mental health assessment, clinical decision making, and treatment planning. In this paper, we investigate nonverbal behavioral markers of depression severity assessed during semi-structured medical interviews of adolescent patients. The main goal of our research is two-fold: studying a unique population of adolescents at high risk of mental disorders and differentiating mild depression from moderate or severe depression. We aim to explore computationally inferred facial and vocal behavioral responses elicited by three segments of the semi-structured medical interviews: Distress Assessment Questions, Ubiquitous Questions, and Concept Questions. Our experimental methodology reflects best practise used for analyzing small sample size and unbalanced datasets of unique patients. Our results show a very interesting trend with strongly discriminative behavioral markers from both acoustic and visual modalities. These promising results are likely due to the unique classification task (mild depression vs. moderate and severe depression) and three types of probing questions.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2020 ","pages":"70-78"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8005296/pdf/nihms-1680574.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25530531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The standard clinical assessment of pain is limited primarily to self-reported pain or clinician impression. While the self-reported measurement of pain is useful, in some circumstances it cannot be obtained. Automatic facial expression analysis has emerged as a potential solution for an objective, reliable, and valid measurement of pain. In this study, we propose a video based approach for the automatic measurement of self-reported pain and the observer pain intensity, respectively. To this end, we explore the added value of three self-reported pain scales, i.e., the Visual Analog Scale (VAS), the Sensory Scale (SEN), and the Affective Motivational Scale (AFF), as well as the Observer Pain Intensity (OPI) rating for a reliable assessment of pain intensity from facial expression. Using a spatio-temporal Convolutional Neural Network - Recurrent Neural Network (CNN-RNN) architecture, we propose to jointly minimize the mean absolute error of pain scores estimation for each of these scales while maximizing the consistency between them. The reliability of the proposed method is evaluated on the benchmark database for pain measurement from videos, namely, the UNBC-McMaster Pain Archive. Our results show that enforcing the consistency between different self-reported pain intensity scores collected using different pain scales enhances the quality of predictions and improve the state of the art in automatic self-reported pain estimation. The obtained results suggest that automatic assessment of self-reported pain intensity from videos is feasible, and could be used as a complementary instrument to unburden caregivers, specially for vulnerable populations that need constant monitoring.
{"title":"Enforcing Multilabel Consistency for Automatic Spatio-Temporal Assessment of Shoulder Pain Intensity.","authors":"Diyala Erekat, Zakia Hammal, Maimoon Siddiqui, Hamdi Dibeklioğlu","doi":"10.1145/3395035.3425190","DOIUrl":"10.1145/3395035.3425190","url":null,"abstract":"<p><p>The standard clinical assessment of pain is limited primarily to self-reported pain or clinician impression. While the self-reported measurement of pain is useful, in some circumstances it cannot be obtained. Automatic facial expression analysis has emerged as a potential solution for an objective, reliable, and valid measurement of pain. In this study, we propose a video based approach for the automatic measurement of self-reported pain and the observer pain intensity, respectively. To this end, we explore the added value of three self-reported pain scales, i.e., the Visual Analog Scale (VAS), the Sensory Scale (SEN), and the Affective Motivational Scale (AFF), as well as the Observer Pain Intensity (OPI) rating for a reliable assessment of pain intensity from facial expression. Using a spatio-temporal Convolutional Neural Network - Recurrent Neural Network (CNN-RNN) architecture, we propose to jointly minimize the mean absolute error of pain scores estimation for each of these scales while maximizing the consistency between them. The reliability of the proposed method is evaluated on the benchmark database for pain measurement from videos, namely, the UNBC-McMaster Pain Archive. Our results show that enforcing the consistency between different self-reported pain intensity scores collected using different pain scales enhances the quality of predictions and improve the state of the art in automatic self-reported pain estimation. The obtained results suggest that automatic assessment of self-reported pain intensity from videos is feasible, and could be used as a complementary instrument to unburden caregivers, specially for vulnerable populations that need constant monitoring.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2020 ","pages":"156-164"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3395035.3425190","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39858931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leili Tavabi, Brian Borsari, Kalin Stefanov, Joshua D Woolley, Mohammad Soleymani, Larry Zhang, Stefan Scherer
Motivational Interviewing (MI) is defined as a collaborative conversation style that evokes the client's own intrinsic reasons for behavioral change. In MI research, the clients' attitude (willingness or resistance) toward change as expressed through language, has been identified as an important indicator of their subsequent behavior change. Automated coding of these indicators provides systematic and efficient means for the analysis and assessment of MI therapy sessions. In this paper, we study and analyze behavioral cues in client language and speech that bear indications of the client's behavior toward change during a therapy session, using a database of dyadic motivational interviews between therapists and clients with alcohol-related problems. Deep language and voice encoders, i.e., BERT and VGGish, trained on large amounts of data are used to extract features from each utterance. We develop a neural network to automatically detect the MI codes using both the clients' and therapists' language and clients' voice, and demonstrate the importance of semantic context in such detection. Additionally, we develop machine learning models for predicting alcohol-use behavioral outcomes of clients through language and voice analysis. Our analysis demonstrates that we are able to estimate MI codes using clients' textual utterances along with preceding textual context from both the therapist and client, reaching an F1-score of 0.72 for a speaker-independent three-class classification. We also report initial results for using the clients' data for predicting behavioral outcomes, which outlines the direction for future work.
动机访谈法(MI)被定义为一种合作式谈话方式,它能唤起客户自身内在的行为改变原因。在动机访谈研究中,客户通过语言表达的对改变的态度(意愿或抵制)被认为是他们随后行为改变的一个重要指标。对这些指标进行自动编码,为分析和评估多元智能疗法的疗程提供了系统而有效的方法。在本文中,我们利用治疗师与有酒精相关问题的客户之间的双人动机访谈数据库,研究并分析了客户语言和语音中的行为线索,这些线索表明客户在治疗过程中的改变行为。深度语言和语音编码器(即 BERT 和 VGGish)在大量数据的基础上经过训练,可用于从每个语句中提取特征。我们开发了一个神经网络,利用客户和治疗师的语言以及客户的声音自动检测 MI 代码,并证明了语义上下文在此类检测中的重要性。此外,我们还开发了机器学习模型,通过语言和语音分析预测客户的酒精使用行为结果。我们的分析表明,我们能够利用客户的文本语句以及治疗师和客户之前的文本上下文来估算 MI 代码,在与说话者无关的三类分类中达到了 0.72 的 F1 分数。我们还报告了使用客户数据预测行为结果的初步结果,这为今后的工作指明了方向。
{"title":"Multimodal Automatic Coding of Client Behavior in Motivational Interviewing.","authors":"Leili Tavabi, Brian Borsari, Kalin Stefanov, Joshua D Woolley, Mohammad Soleymani, Larry Zhang, Stefan Scherer","doi":"10.1145/3382507.3418853","DOIUrl":"10.1145/3382507.3418853","url":null,"abstract":"<p><p>Motivational Interviewing (MI) is defined as a collaborative conversation style that evokes the client's own intrinsic reasons for behavioral change. In MI research, the clients' attitude (willingness or resistance) toward change as expressed through language, has been identified as an important indicator of their subsequent behavior change. Automated coding of these indicators provides systematic and efficient means for the analysis and assessment of MI therapy sessions. In this paper, we study and analyze behavioral cues in client language and speech that bear indications of the client's behavior toward change during a therapy session, using a database of dyadic motivational interviews between therapists and clients with alcohol-related problems. Deep language and voice encoders, <i>i.e.,</i> BERT and VGGish, trained on large amounts of data are used to extract features from each utterance. We develop a neural network to automatically detect the MI codes using both the clients' and therapists' language and clients' voice, and demonstrate the importance of semantic context in such detection. Additionally, we develop machine learning models for predicting alcohol-use behavioral outcomes of clients through language and voice analysis. Our analysis demonstrates that we are able to estimate MI codes using clients' textual utterances along with preceding textual context from both the therapist and client, reaching an F1-score of 0.72 for a speaker-independent three-class classification. We also report initial results for using the clients' data for predicting behavioral outcomes, which outlines the direction for future work.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2020 ","pages":"406-413"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321780/pdf/nihms-1727152.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39266881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeffrey F Cohn, Michael S Okun, Laszlo A Jeni, Itir Onal Ertugrul, David Borton, Donald Malone, Wayne K Goodman
Automated measurement of affective behavior in psychopathology has been limited primarily to screening and diagnosis. While useful, clinicians more often are concerned with whether patients are improving in response to treatment. Are symptoms abating, is affect becoming more positive, are unanticipated side effects emerging? When treatment includes neural implants, need for objective, repeatable biometrics tied to neurophysiology becomes especially pressing. We used automated face analysis to assess treatment response to deep brain stimulation (DBS) in two patients with intractable obsessive-compulsive disorder (OCD). One was assessed intraoperatively following implantation and activation of the DBS device. The other was assessed three months post-implantation. Both were assessed during DBS on and o conditions. Positive and negative valence were quantified using a CNN trained on normative data of 160 non-OCD participants. Thus, a secondary goal was domain transfer of the classifiers. In both contexts, DBS-on resulted in marked positive affect. In response to DBS-off, affect flattened in both contexts and alternated with increased negative affect in the outpatient setting. Mean AUC for domain transfer was 0.87. These findings suggest that parametric variation of DBS is strongly related to affective behavior and may introduce vulnerability for negative affect in the event that DBS is discontinued.
{"title":"Automated Affect Detection in Deep Brain Stimulation for Obsessive-Compulsive Disorder: A Pilot Study.","authors":"Jeffrey F Cohn, Michael S Okun, Laszlo A Jeni, Itir Onal Ertugrul, David Borton, Donald Malone, Wayne K Goodman","doi":"10.1145/3242969.3243023","DOIUrl":"10.1145/3242969.3243023","url":null,"abstract":"<p><p>Automated measurement of affective behavior in psychopathology has been limited primarily to screening and diagnosis. While useful, clinicians more often are concerned with whether patients are improving in response to treatment. Are symptoms abating, is affect becoming more positive, are unanticipated side effects emerging? When treatment includes neural implants, need for objective, repeatable biometrics tied to neurophysiology becomes especially pressing. We used automated face analysis to assess treatment response to deep brain stimulation (DBS) in two patients with intractable obsessive-compulsive disorder (OCD). One was assessed intraoperatively following implantation and activation of the DBS device. The other was assessed three months post-implantation. Both were assessed during DBS on and o conditions. Positive and negative valence were quantified using a CNN trained on normative data of 160 non-OCD participants. Thus, a secondary goal was domain transfer of the classifiers. In both contexts, DBS-on resulted in marked positive affect. In response to DBS-off, affect flattened in both contexts and alternated with increased negative affect in the outpatient setting. Mean AUC for domain transfer was 0.87. These findings suggest that parametric variation of DBS is strongly related to affective behavior and may introduce vulnerability for negative affect in the event that DBS is discontinued.</p>","PeriodicalId":74508,"journal":{"name":"Proceedings of the ... ACM International Conference on Multimodal Interaction. ICMI (Conference)","volume":"2018 ","pages":"40-44"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3242969.3243023","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36748553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}