Companion Publication of the 2020 International Conference on Multimodal Interaction最新文献_第8页

EEG-based Cognitive Load Classification using Feature Masked Autoencoding and Emotion Transfer Learning 基于特征掩蔽自编码和情绪迁移学习的脑电认知负荷分类

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614113

Dustin Pulver, Prithila Angkan, Paul Hungler, Ali Etemad

Cognitive load, the amount of mental effort required for task completion, plays an important role in performance and decision-making outcomes, making its classification and analysis essential in various sensitive domains. In this paper, we present a new solution for the classification of cognitive load using electroencephalogram (EEG). Our model uses a transformer architecture employing transfer learning between emotions and cognitive load. We pre-train our model using self-supervised masked autoencoding on emotion-related EEG datasets and use transfer learning with both frozen weights and fine-tuning to perform downstream cognitive load classification. To evaluate our method, we carry out a series of experiments utilizing two publicly available EEG-based emotion datasets, namely SEED and SEED-IV, for pre-training, while we use the CL-Drive dataset for downstream cognitive load classification. The results of our experiments show that our proposed approach achieves strong results and outperforms conventional single-stage fully supervised learning. Moreover, we perform detailed ablation and sensitivity studies to evaluate the impact of different aspects of our proposed solution. This research contributes to the growing body of literature in affective computing with a focus on cognitive load, and opens up new avenues for future research in the field of cross-domain transfer learning using self-supervised pre-training.

认知负荷，即完成任务所需的脑力劳动量，在绩效和决策结果中起着重要作用，因此在各种敏感领域对其进行分类和分析是必不可少的。本文提出了一种新的基于脑电图的认知负荷分类方法。我们的模型使用了一个在情绪和认知负荷之间迁移学习的转换器架构。我们在情绪相关的脑电图数据集上使用自监督掩码自动编码对模型进行预训练，并使用具有固定权重和微调的迁移学习来执行下游认知负荷分类。为了评估我们的方法，我们利用两个公开的基于脑电图的情绪数据集SEED和SEED- iv进行了一系列实验，用于预训练，同时我们使用CL-Drive数据集进行下游认知负荷分类。我们的实验结果表明，我们提出的方法取得了很强的效果，并且优于传统的单阶段全监督学习。此外，我们还进行了详细的消融和敏感性研究，以评估我们提出的解决方案的不同方面的影响。本研究为关注认知负荷的情感计算领域的研究做出了贡献，并为使用自监督预训练的跨领域迁移学习领域的未来研究开辟了新的途径。

{"title":"EEG-based Cognitive Load Classification using Feature Masked Autoencoding and Emotion Transfer Learning","authors":"Dustin Pulver, Prithila Angkan, Paul Hungler, Ali Etemad","doi":"10.1145/3577190.3614113","DOIUrl":"https://doi.org/10.1145/3577190.3614113","url":null,"abstract":"Cognitive load, the amount of mental effort required for task completion, plays an important role in performance and decision-making outcomes, making its classification and analysis essential in various sensitive domains. In this paper, we present a new solution for the classification of cognitive load using electroencephalogram (EEG). Our model uses a transformer architecture employing transfer learning between emotions and cognitive load. We pre-train our model using self-supervised masked autoencoding on emotion-related EEG datasets and use transfer learning with both frozen weights and fine-tuning to perform downstream cognitive load classification. To evaluate our method, we carry out a series of experiments utilizing two publicly available EEG-based emotion datasets, namely SEED and SEED-IV, for pre-training, while we use the CL-Drive dataset for downstream cognitive load classification. The results of our experiments show that our proposed approach achieves strong results and outperforms conventional single-stage fully supervised learning. Moreover, we perform detailed ablation and sensitivity studies to evaluate the impact of different aspects of our proposed solution. This research contributes to the growing body of literature in affective computing with a focus on cognitive load, and opens up new avenues for future research in the field of cross-domain transfer learning using self-supervised pre-training.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Representation Learning for Interpersonal and Multimodal Behavior Dynamics: A Multiview Extension of Latent Change Score Models 人际和多模态行为动力学的表征学习:潜在变化评分模型的多视角扩展

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614118

Alexandria K. Vail, Jeffrey M. Girard, Lauren M. Bylsma, Jay Fournier, Holly A. Swartz, Jeffrey F. Cohn, Louis-Philippe Morency

Characterizing the dynamics of behavior across multiple modalities and individuals is a vital component of computational behavior analysis. This is especially important in certain applications, such as psychotherapy, where individualized tracking of behavior patterns can provide valuable information about the patient’s mental state. Conventional methods that rely on aggregate statistics and correlational metrics may not always suffice, as they are often unable to capture causal relationships or evaluate the true probability of identified patterns. To address these challenges, we present a novel approach to learning multimodal and interpersonal representations of behavior dynamics during one-on-one interaction. Our approach is enabled by the introduction of a multiview extension of latent change score models, which facilitates the concurrent capture of both inter-modal and interpersonal behavior dynamics and the identification of directional relationships between them. A core advantage of our approach is its high level of interpretability while simultaneously achieving strong predictive performance. We evaluate our approach within the domain of therapist-client interactions, with the objective of gaining a deeper understanding about the collaborative relationship between the two, a crucial element of the therapeutic process. Our results demonstrate improved performance over conventional approaches that rely upon summary statistics or correlational metrics. Furthermore, since our multiview approach includes the explicit modeling of uncertainty, it naturally lends itself to integration with probabilistic classifiers, such as Gaussian process models. We demonstrate that this integration leads to even further improved performance, all the while maintaining highly interpretable qualities. Our analysis provides compelling motivation for further exploration of stochastic systems within computational models of behavior.

描述跨多种模式和个体的行为动态是计算行为分析的重要组成部分。这在某些应用中尤其重要，例如心理治疗，在这些应用中，对行为模式的个性化跟踪可以提供有关患者精神状态的宝贵信息。依赖于汇总统计和相关度量的传统方法可能并不总是足够的，因为它们通常无法捕捉因果关系或评估已识别模式的真实概率。为了解决这些挑战，我们提出了一种新的方法来学习一对一互动过程中行为动力学的多模态和人际表征。我们的方法是通过引入潜在变化评分模型的多视角扩展来实现的，这有助于同时捕获模态间和人际行为动态，并识别它们之间的方向关系。我们的方法的一个核心优势是其高水平的可解释性，同时实现强大的预测性能。我们在治疗师-客户互动领域评估我们的方法，目的是更深入地了解两者之间的合作关系，这是治疗过程的关键要素。我们的结果表明，与依赖于汇总统计或相关度量的传统方法相比，性能得到了改善。此外，由于我们的多视图方法包括不确定性的显式建模，因此它自然适合与概率分类器(如高斯过程模型)集成。我们证明了这种集成可以进一步提高性能，同时保持高度可解释的质量。我们的分析为进一步探索行为计算模型中的随机系统提供了令人信服的动机。

{"title":"Representation Learning for Interpersonal and Multimodal Behavior Dynamics: A Multiview Extension of Latent Change Score Models","authors":"Alexandria K. Vail, Jeffrey M. Girard, Lauren M. Bylsma, Jay Fournier, Holly A. Swartz, Jeffrey F. Cohn, Louis-Philippe Morency","doi":"10.1145/3577190.3614118","DOIUrl":"https://doi.org/10.1145/3577190.3614118","url":null,"abstract":"Characterizing the dynamics of behavior across multiple modalities and individuals is a vital component of computational behavior analysis. This is especially important in certain applications, such as psychotherapy, where individualized tracking of behavior patterns can provide valuable information about the patient’s mental state. Conventional methods that rely on aggregate statistics and correlational metrics may not always suffice, as they are often unable to capture causal relationships or evaluate the true probability of identified patterns. To address these challenges, we present a novel approach to learning multimodal and interpersonal representations of behavior dynamics during one-on-one interaction. Our approach is enabled by the introduction of a multiview extension of latent change score models, which facilitates the concurrent capture of both inter-modal and interpersonal behavior dynamics and the identification of directional relationships between them. A core advantage of our approach is its high level of interpretability while simultaneously achieving strong predictive performance. We evaluate our approach within the domain of therapist-client interactions, with the objective of gaining a deeper understanding about the collaborative relationship between the two, a crucial element of the therapeutic process. Our results demonstrate improved performance over conventional approaches that rely upon summary statistics or correlational metrics. Furthermore, since our multiview approach includes the explicit modeling of uncertainty, it naturally lends itself to integration with probabilistic classifiers, such as Gaussian process models. We demonstrate that this integration leads to even further improved performance, all the while maintaining highly interpretable qualities. Our analysis provides compelling motivation for further exploration of stochastic systems within computational models of behavior.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying Interlocutors' Behaviors and its Timings Involved with Impression Formation from Head-Movement Features and Linguistic Features 从头部运动特征和语言特征识别对话者的行为及其时间与印象形成有关

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614124

Shumpei Otsuchi, Koya Ito, Yoko Ishii, Ryo Ishii, Shinichirou Eitoku, Kazuhiro Otsuka

A prediction-explanation framework is proposed to identify when and what behaviors are involved in forming interlocutors’ impressions in group discussions. We targeted the self-reported scores of 16 impressions, including enjoyment and concentration. To that end, we formulate the problem as discovering behavioral features that contributed to the impression prediction and determining the timings that the behaviors frequently occurred. To solve this problem, this paper proposes a two-fold framework consisting of the prediction part followed by the explanation part. The former prediction part employs random forest regressors using functional head-movement features and BERT-based linguistic features, which can capture various aspects of interactive conversational behaviors. The later part measures the levels of features’ contribution to the prediction using a SHAP analysis and introduces a novel idea of temporal decomposition of features’ contributions over time. The influential behaviors and their timings are identified from local maximums of the temporal distribution of features’ contributions. Targeting 17-group 4-female discussions, the predictability and explainability of the proposed framework are confirmed by some case studies and quantitative evaluations of the detected timings.

提出了一个预测-解释框架来确定小组讨论中对话者印象形成的时间和行为。我们的目标是16种印象的自我报告分数，包括享受和专注。为此，我们将问题表述为发现有助于印象预测的行为特征，并确定行为频繁发生的时间。为了解决这一问题，本文提出了一个由预测部分和解释部分组成的双重框架。前者的预测部分采用随机森林回归，利用头部运动功能特征和基于bert的语言特征，可以捕捉交互式会话行为的各个方面。后一部分使用SHAP分析测量特征对预测的贡献水平，并引入了特征随时间贡献的时间分解的新思想。通过特征贡献时间分布的局部最大值来确定影响行为及其时间。针对17组4名女性的讨论，提出的框架的可预测性和可解释性得到了一些案例研究和对所发现时间的定量评价的证实。

{"title":"Identifying Interlocutors' Behaviors and its Timings Involved with Impression Formation from Head-Movement Features and Linguistic Features","authors":"Shumpei Otsuchi, Koya Ito, Yoko Ishii, Ryo Ishii, Shinichirou Eitoku, Kazuhiro Otsuka","doi":"10.1145/3577190.3614124","DOIUrl":"https://doi.org/10.1145/3577190.3614124","url":null,"abstract":"A prediction-explanation framework is proposed to identify when and what behaviors are involved in forming interlocutors’ impressions in group discussions. We targeted the self-reported scores of 16 impressions, including enjoyment and concentration. To that end, we formulate the problem as discovering behavioral features that contributed to the impression prediction and determining the timings that the behaviors frequently occurred. To solve this problem, this paper proposes a two-fold framework consisting of the prediction part followed by the explanation part. The former prediction part employs random forest regressors using functional head-movement features and BERT-based linguistic features, which can capture various aspects of interactive conversational behaviors. The later part measures the levels of features’ contribution to the prediction using a SHAP analysis and introduces a novel idea of temporal decomposition of features’ contributions over time. The influential behaviors and their timings are identified from local maximums of the temporal distribution of features’ contributions. Targeting 17-group 4-female discussions, the predictability and explainability of the proposed framework are confirmed by some case studies and quantitative evaluations of the detected timings.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward Fair Facial Expression Recognition with Improved Distribution Alignment 基于改进分布对齐的公平面部表情识别

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614141

Mojtaba Kolahdouzi, Ali Etemad

We present a novel approach to mitigate bias in facial expression recognition (FER) models. Our method aims to reduce sensitive attribute information such as gender, age, or race, in the embeddings produced by FER models. We employ a kernel mean shrinkage estimator to estimate the kernel mean of the distributions of the embeddings associated with different sensitive attribute groups, such as young and old, in the Hilbert space. Using this estimation, we calculate the maximum mean discrepancy (MMD) distance between the distributions and incorporate it in the classifier loss along with an adversarial loss, which is then minimized through the learning process to improve the distribution alignment. Our method makes sensitive attributes less recognizable for the model, which in turn promotes fairness. Additionally, for the first time, we analyze the notion of attractiveness as an important sensitive attribute in FER models and demonstrate that FER models can indeed exhibit biases towards more attractive faces. To prove the efficacy of our model in reducing bias regarding different sensitive attributes (including the newly proposed attractiveness attribute), we perform several experiments on two widely used datasets, CelebA and RAF-DB. The results in terms of both accuracy and fairness measures outperform the state-of-the-art in most cases, demonstrating the effectiveness of the proposed method.

我们提出了一种新的方法来减轻面部表情识别(FER)模型中的偏见。我们的方法旨在减少FER模型产生的嵌入中的敏感属性信息，如性别、年龄或种族。我们使用核均值收缩估计器来估计Hilbert空间中与不同敏感属性组(如年轻和年老)相关的嵌入分布的核均值。使用此估计，我们计算分布之间的最大平均差异(MMD)距离，并将其与对抗损失一起纳入分类器损失中，然后通过学习过程将其最小化以改善分布一致性。我们的方法降低了模型的敏感属性的可识别性，从而提高了公平性。此外，我们首次分析了吸引力作为FER模型中一个重要敏感属性的概念，并证明了FER模型确实会对更有吸引力的面孔表现出偏见。为了证明我们的模型在减少不同敏感属性(包括新提出的吸引力属性)的偏差方面的有效性，我们在两个广泛使用的数据集CelebA和RAF-DB上进行了几个实验。结果在准确性和公平性方面的措施优于国家的最先进的在大多数情况下，证明了所提出的方法的有效性。

{"title":"Toward Fair Facial Expression Recognition with Improved Distribution Alignment","authors":"Mojtaba Kolahdouzi, Ali Etemad","doi":"10.1145/3577190.3614141","DOIUrl":"https://doi.org/10.1145/3577190.3614141","url":null,"abstract":"We present a novel approach to mitigate bias in facial expression recognition (FER) models. Our method aims to reduce sensitive attribute information such as gender, age, or race, in the embeddings produced by FER models. We employ a kernel mean shrinkage estimator to estimate the kernel mean of the distributions of the embeddings associated with different sensitive attribute groups, such as young and old, in the Hilbert space. Using this estimation, we calculate the maximum mean discrepancy (MMD) distance between the distributions and incorporate it in the classifier loss along with an adversarial loss, which is then minimized through the learning process to improve the distribution alignment. Our method makes sensitive attributes less recognizable for the model, which in turn promotes fairness. Additionally, for the first time, we analyze the notion of attractiveness as an important sensitive attribute in FER models and demonstrate that FER models can indeed exhibit biases towards more attractive faces. To prove the efficacy of our model in reducing bias regarding different sensitive attributes (including the newly proposed attractiveness attribute), we perform several experiments on two widely used datasets, CelebA and RAF-DB. The results in terms of both accuracy and fairness measures outperform the state-of-the-art in most cases, demonstrating the effectiveness of the proposed method.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating Outside the Box: Lessons Learned on eXtended Reality Multi-modal Experiments Beyond the Laboratory 评估盒外:经验教训扩展现实多模态实验超出实验室

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614134

Bernardo Marques, Samuel Silva, Rafael Maio, João Alves, Carlos Ferreira, Paulo Dias, Beatriz Sousa Santos

Over time, numerous multimodal eXtended Reality (XR) user studies have been conducted in laboratory environments, with participants fulfilling tasks under the guidance of a researcher. Although generalizable results contributed to increase the maturity of the field, it is also paramount to address the ecological validity of evaluations outside the laboratory. Despite real-world scenarios being clearly challenging, successful in-situ and remote deployment has become realistic to address a broad variety of research questions, thus, expanding participants’ sample to more specific target users, considering multi-modal constraints not reflected in controlled laboratory settings and other benefits. In this paper, a set of multimodal XR experiments conducted outside the laboratory are described (e.g., industrial field studies, remote collaborative tasks, longitudinal rehabilitation exercises). Then, a list of lessons learned is reported, illustrating challenges, and opportunities, aiming to increase the level of awareness of the research community and facilitate performing further evaluations.

随着时间的推移，在实验室环境中进行了许多多模态扩展现实(XR)用户研究，参与者在研究人员的指导下完成任务。虽然可推广的结果有助于提高该领域的成熟度，但在实验室之外解决评估的生态有效性也是至关重要的。尽管现实世界的场景显然具有挑战性，但成功的现场和远程部署已经成为现实，可以解决各种各样的研究问题，从而将参与者的样本扩展到更具体的目标用户，考虑到受控实验室环境中未反映的多模态约束和其他好处。本文描述了一组在实验室之外进行的多模式XR实验(例如，工业现场研究，远程协作任务，纵向康复练习)。然后，报告一份经验教训清单，说明挑战和机遇，旨在提高研究界的认识水平，并促进进行进一步的评估。

{"title":"Evaluating Outside the Box: Lessons Learned on eXtended Reality Multi-modal Experiments Beyond the Laboratory","authors":"Bernardo Marques, Samuel Silva, Rafael Maio, João Alves, Carlos Ferreira, Paulo Dias, Beatriz Sousa Santos","doi":"10.1145/3577190.3614134","DOIUrl":"https://doi.org/10.1145/3577190.3614134","url":null,"abstract":"Over time, numerous multimodal eXtended Reality (XR) user studies have been conducted in laboratory environments, with participants fulfilling tasks under the guidance of a researcher. Although generalizable results contributed to increase the maturity of the field, it is also paramount to address the ecological validity of evaluations outside the laboratory. Despite real-world scenarios being clearly challenging, successful in-situ and remote deployment has become realistic to address a broad variety of research questions, thus, expanding participants’ sample to more specific target users, considering multi-modal constraints not reflected in controlled laboratory settings and other benefits. In this paper, a set of multimodal XR experiments conducted outside the laboratory are described (e.g., industrial field studies, remote collaborative tasks, longitudinal rehabilitation exercises). Then, a list of lessons learned is reported, illustrating challenges, and opportunities, aiming to increase the level of awareness of the research community and facilitate performing further evaluations.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Turn Analysis and Prediction for Multi-party Conversations 多方对话的多模态转向分析与预测

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614139

Meng-Chen Lee, Mai Trinh, Zhigang Deng

This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal natural conversational behaviors of interlocutors in three-party conversations, including gazes, head movements, body movements, speech, etc. Based on the inter-pausal units (IPUs) extracted from the in-house acquired dataset, we propose a transformer-based computational model to predict the turns based on the interlocutor states (speaking/back-channeling/silence) and the gaze targets. Our model can robustly achieve more than 80% accuracy, and the generalizability of our model was extensively validated through cross-group experiments. Also, we introduce a novel computational metric called “relative engagement level" (REL) of IPUs, and further validate its statistical significance between turn-keeping IPUs and turn-taking IPUs, and between different conversational groups. Our experimental results also found that the patterns of the interlocutor states can be used as a more effective cue than their gaze behaviors for predicting turns in multiparty conversations.

本文提出了一种分析和预测多方对话中的回合(即回合选择和回合保持)的计算方法。具体而言，我们使用高保真混合数据采集系统来捕获三方对话中对话者的大规模多模态自然会话行为，包括凝视、头部运动、身体运动、语音等。基于从内部采集的数据集中提取的间歇单位(ipu)，我们提出了一个基于变压器的计算模型，该模型基于对话者状态(说话/反向通道/沉默)和凝视目标来预测转弯。该模型稳健性达到80%以上的准确率，并通过跨组实验广泛验证了模型的泛化性。此外，我们引入了一种新的计算度量，即ipu的“相对参与水平”(REL)，并进一步验证了其在回合保持ipu和回合采取ipu之间以及不同会话组之间的统计显著性。我们的实验结果还发现，在多方对话中，对话者状态的模式可以作为比凝视行为更有效的线索来预测对话的转向。

{"title":"Multimodal Turn Analysis and Prediction for Multi-party Conversations","authors":"Meng-Chen Lee, Mai Trinh, Zhigang Deng","doi":"10.1145/3577190.3614139","DOIUrl":"https://doi.org/10.1145/3577190.3614139","url":null,"abstract":"This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal natural conversational behaviors of interlocutors in three-party conversations, including gazes, head movements, body movements, speech, etc. Based on the inter-pausal units (IPUs) extracted from the in-house acquired dataset, we propose a transformer-based computational model to predict the turns based on the interlocutor states (speaking/back-channeling/silence) and the gaze targets. Our model can robustly achieve more than 80% accuracy, and the generalizability of our model was extensively validated through cross-group experiments. Also, we introduce a novel computational metric called “relative engagement level\" (REL) of IPUs, and further validate its statistical significance between turn-keeping IPUs and turn-taking IPUs, and between different conversational groups. Our experimental results also found that the patterns of the interlocutor states can be used as a more effective cue than their gaze behaviors for predicting turns in multiparty conversations.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Bias: Assessing Gender Bias in Computer Vision Models with NLP Techniques 多模态偏差:用NLP技术评估计算机视觉模型中的性别偏差

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614156

Abhishek Mandal, Suzanne Little, Susan Leavy

Large multimodal deep learning models such as Contrastive Language Image Pretraining (CLIP) have become increasingly powerful with applications across several domains in recent years. CLIP works on visual and language modalities and forms a part of several popular models, such as DALL-E and Stable Diffusion. It is trained on a large dataset of millions of image-text pairs crawled from the internet. Such large datasets are often used for training purposes without filtering, leading to models inheriting social biases from internet data. Given that models such as CLIP are being applied in such a wide variety of applications ranging from social media to education, it is vital that harmful biases are detected. However, due to the unbounded nature of the possible inputs and outputs, traditional bias metrics such as accuracy cannot detect the range and complexity of biases present in the model. In this paper, we present an audit of CLIP using an established technique from natural language processing called Word Embeddings Association Test (WEAT) to detect and quantify gender bias in CLIP and demonstrate that it can provide a quantifiable measure of such stereotypical associations. We detected, measured, and visualised various types of stereotypical gender associations with respect to character descriptions and occupations and found that CLIP shows evidence of stereotypical gender bias.

近年来，对比语言图像预训练(CLIP)等大型多模态深度学习模型在多个领域的应用越来越强大。CLIP在视觉和语言模式上工作，并形成了几个流行模型的一部分，如DALL-E和稳定扩散。它是在从互联网上抓取的数百万图像-文本对的大型数据集上进行训练的。这种大型数据集通常用于训练目的而不进行过滤，导致模型从互联网数据中继承社会偏见。鉴于像CLIP这样的模型正被应用于从社交媒体到教育等各种各样的应用中，发现有害的偏见至关重要。然而，由于可能输入和输出的无界性质，传统的偏差度量(如精度)无法检测模型中存在的偏差的范围和复杂性。在本文中，我们使用一种来自自然语言处理的成熟技术，称为词嵌入关联测试(WEAT)，对CLIP进行审计，以检测和量化CLIP中的性别偏见，并证明它可以提供这种刻板印象关联的可量化测量。我们检测、测量并可视化了与角色描述和职业相关的各种类型的刻板印象性别关联，发现CLIP显示了刻板印象性别偏见的证据。

{"title":"Multimodal Bias: Assessing Gender Bias in Computer Vision Models with NLP Techniques","authors":"Abhishek Mandal, Suzanne Little, Susan Leavy","doi":"10.1145/3577190.3614156","DOIUrl":"https://doi.org/10.1145/3577190.3614156","url":null,"abstract":"Large multimodal deep learning models such as Contrastive Language Image Pretraining (CLIP) have become increasingly powerful with applications across several domains in recent years. CLIP works on visual and language modalities and forms a part of several popular models, such as DALL-E and Stable Diffusion. It is trained on a large dataset of millions of image-text pairs crawled from the internet. Such large datasets are often used for training purposes without filtering, leading to models inheriting social biases from internet data. Given that models such as CLIP are being applied in such a wide variety of applications ranging from social media to education, it is vital that harmful biases are detected. However, due to the unbounded nature of the possible inputs and outputs, traditional bias metrics such as accuracy cannot detect the range and complexity of biases present in the model. In this paper, we present an audit of CLIP using an established technique from natural language processing called Word Embeddings Association Test (WEAT) to detect and quantify gender bias in CLIP and demonstrate that it can provide a quantifiable measure of such stereotypical associations. We detected, measured, and visualised various types of stereotypical gender associations with respect to character descriptions and occupations and found that CLIP shows evidence of stereotypical gender bias.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Augmented Immersive Viewing and Listening Experience Based on Arbitrarily Angled Interactive Audiovisual Representation 基于任意角度交互式视听表现的增强沉浸式视听体验

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614138

Toshiharu Horiuchi, Shota Okubo, Tatsuya Kobayashi

We propose an arbitrarily angled interactive audiovisual representation technique that combines a unique sound field synthesis with visual representation in order to augment the possibility of interactive immersive viewing experiences on mobile devices. This technique can synthesize two-channel stereo sound with constant stereo width having an arbitrary angle range from minimum 30 to maximum 360 degrees centering on an arbitrary direction from multi-channel surround sound. The visual representation can be chosen either equirectangular projection or stereographic projection. The developed video player app allows users to enjoy arbitrarily angled 360-degree videos by manipulating the touchscreen, and the stereo sound and the visual representation changes in terms of its spatial synchronization depending on the view. The app was released as a demonstration, and its acceptability and worth were investigated through interviews and subjective assessment tests. The app has been well received, and to date, more than 30 pieces of content have been produced in multiple genres, with a total of more than 200,000 views.

我们提出了一种任意角度的交互式视听表现技术，将独特的声场合成与视觉表现相结合，以增加在移动设备上交互式沉浸式观看体验的可能性。该技术可以合成具有恒定立体声宽度的双声道立体声，具有以任意方向为中心的最小30度到最大360度的任意角度范围。视觉表示可以选择等矩形投影或立体投影。开发的视频播放器应用程序可以让用户通过操作触摸屏来欣赏任意角度的360度视频，并且立体声和视觉表现根据视图的空间同步变化。该应用作为示范发布，并通过访谈和主观评估测试来调查其可接受性和价值。这款应用广受好评，到目前为止，已经制作了30多篇不同类型的内容，总浏览量超过20万次。

{"title":"Augmented Immersive Viewing and Listening Experience Based on Arbitrarily Angled Interactive Audiovisual Representation","authors":"Toshiharu Horiuchi, Shota Okubo, Tatsuya Kobayashi","doi":"10.1145/3577190.3614138","DOIUrl":"https://doi.org/10.1145/3577190.3614138","url":null,"abstract":"We propose an arbitrarily angled interactive audiovisual representation technique that combines a unique sound field synthesis with visual representation in order to augment the possibility of interactive immersive viewing experiences on mobile devices. This technique can synthesize two-channel stereo sound with constant stereo width having an arbitrary angle range from minimum 30 to maximum 360 degrees centering on an arbitrary direction from multi-channel surround sound. The visual representation can be chosen either equirectangular projection or stereographic projection. The developed video player app allows users to enjoy arbitrarily angled 360-degree videos by manipulating the touchscreen, and the stereo sound and the visual representation changes in terms of its spatial synchronization depending on the view. The app was released as a demonstration, and its acceptability and worth were investigated through interviews and subjective assessment tests. The app has been well received, and to date, more than 30 pieces of content have been produced in multiple genres, with a total of more than 200,000 views.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis 一种时间对齐和量化的协同语音手势合成GRU-Transformer

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614135

Hendric Voß, Stefan Kopp

The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available.

在多模态人工智能体的创建中，生成逼真且与上下文相关的同语音手势是一项具有挑战性但日益重要的任务。先前的方法侧重于学习共同语音手势表征和产生的动作之间的直接对应关系，这些动作在人类评估过程中产生了看似自然但往往不令人信服的手势。我们提出了一种使用带有量化管道的生成对抗网络对部分手势序列进行预训练的方法。生成的码本向量在我们的框架中作为输入和输出，形成手势生成和重建的基础。通过学习潜在空间表示的映射，而不是直接将其映射到向量表示，该框架有助于生成高度逼真和富有表现力的手势，这些手势紧密地复制了人类的运动和行为，同时避免了生成过程中的伪影。我们通过将我们的方法与生成共同语音手势的既定方法以及现有的人类行为数据集进行比较来评估我们的方法。我们还进行了消融研究来评估我们的发现。结果表明，我们的方法明显优于当前的技术水平，并且在一定程度上与人类手势无法区分。我们让数据管道和生成框架公开可用。

{"title":"AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis","authors":"Hendric Voß, Stefan Kopp","doi":"10.1145/3577190.3614135","DOIUrl":"https://doi.org/10.1145/3577190.3614135","url":null,"abstract":"The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning 基于多任务学习的基于局部和全局特征聚合的视听群体情感识别

Companion Publication of the 2020 International Conference on Multimodal Interaction

Pub Date : 2023-10-09 DOI: 10.1145/3577190.3616544

Sunan Li, Hailun Lian, Cheng Lu, Yan Zhao, Chuangao Tang, Yuan Zong, Wenming Zheng

Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.

音频-视频群体情感识别是一项具有挑战性的任务，近几十年来受到了越来越多的关注。最近，深度学习模型在分析人类情感方面取得了巨大进步。然而，由于其难以收集广泛的潜在信息以获得有意义的情感表征以及难以像人类那样将隐含的上下文知识联系起来等困难。为了解决这些问题，本文提出了基于局部和全局特征聚合的多任务学习(LGFAM)方法来解决群体情绪识别问题。该框架由三个并行特征提取网络组成，这些网络在之前的工作中得到了验证。在此基础上，利用MLP作为主干的注意力网络和专门设计的损失函数来融合不同模态的特征。在实验部分，我们展示了它在EmotiW2023基于视听组的情绪识别子挑战中的表现，该子挑战旨在将视频分类为三种情绪之一。根据反馈结果，在测试集上的最佳结果达到了70.63 WAR和70.38 UAR。这种改进证明了我们方法的有效性。

{"title":"Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning","authors":"Sunan Li, Hailun Lian, Cheng Lu, Yan Zhao, Chuangao Tang, Yuan Zong, Wenming Zheng","doi":"10.1145/3577190.3616544","DOIUrl":"https://doi.org/10.1145/3577190.3616544","url":null,"abstract":"Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135045192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0