利用 LLM 进行多模态融合，预测自然对话中的参与度

arXiv - CS - Human-Computer Interaction Pub Date : 2024-09-13 DOI:arxiv-2409.09135

Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K. Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L. Holt, Fernando De la Torre

{"title":"利用 LLM 进行多模态融合，预测自然对话中的参与度","authors":"Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K. Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L. Holt, Fernando De la Torre","doi":"arxiv-2409.09135","DOIUrl":null,"url":null,"abstract":"Over the past decade, wearable computing devices (``smart glasses'') have\nundergone remarkable advancements in sensor technology, design, and processing\npower, ushering in a new era of opportunity for high-density human behavior\ndata. Equipped with wearable cameras, these glasses offer a unique opportunity\nto analyze non-verbal behavior in natural settings as individuals interact. Our\nfocus lies in predicting engagement in dyadic interactions by scrutinizing\nverbal and non-verbal cues, aiming to detect signs of disinterest or confusion.\nLeveraging such analyses may revolutionize our understanding of human\ncommunication, foster more effective collaboration in professional\nenvironments, provide better mental health support through empathetic virtual\ninteractions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in\ncasual dyadic conversations, each providing self-reported engagement ratings at\nthe end of each conversation. We introduce a novel fusion strategy using Large\nLanguage Models (LLMs) to integrate multiple behavior modalities into a\n``multimodal transcript'' that can be processed by an LLM for behavioral\nreasoning tasks. Remarkably, this method achieves performance comparable to\nestablished fusion techniques even in its preliminary implementation,\nindicating strong potential for further research and optimization. This fusion\nmethod is one of the first to approach ``reasoning'' about real-world human\nbehavior through a language model. Smart glasses provide us the ability to\nunobtrusively gather high-density multimodal data on human behavior, paving the\nway for new approaches to understanding and improving human communication with\nthe potential for important societal benefits. The features and data collected\nduring the studies will be made publicly available to promote further research.","PeriodicalId":501541,"journal":{"name":"arXiv - CS - Human-Computer Interaction","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation\",\"authors\":\"Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K. Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L. Holt, Fernando De la Torre\",\"doi\":\"arxiv-2409.09135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the past decade, wearable computing devices (``smart glasses'') have\\nundergone remarkable advancements in sensor technology, design, and processing\\npower, ushering in a new era of opportunity for high-density human behavior\\ndata. Equipped with wearable cameras, these glasses offer a unique opportunity\\nto analyze non-verbal behavior in natural settings as individuals interact. Our\\nfocus lies in predicting engagement in dyadic interactions by scrutinizing\\nverbal and non-verbal cues, aiming to detect signs of disinterest or confusion.\\nLeveraging such analyses may revolutionize our understanding of human\\ncommunication, foster more effective collaboration in professional\\nenvironments, provide better mental health support through empathetic virtual\\ninteractions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in\\ncasual dyadic conversations, each providing self-reported engagement ratings at\\nthe end of each conversation. We introduce a novel fusion strategy using Large\\nLanguage Models (LLMs) to integrate multiple behavior modalities into a\\n``multimodal transcript'' that can be processed by an LLM for behavioral\\nreasoning tasks. Remarkably, this method achieves performance comparable to\\nestablished fusion techniques even in its preliminary implementation,\\nindicating strong potential for further research and optimization. This fusion\\nmethod is one of the first to approach ``reasoning'' about real-world human\\nbehavior through a language model. Smart glasses provide us the ability to\\nunobtrusively gather high-density multimodal data on human behavior, paving the\\nway for new approaches to understanding and improving human communication with\\nthe potential for important societal benefits. The features and data collected\\nduring the studies will be made publicly available to promote further research.\",\"PeriodicalId\":501541,\"journal\":{\"name\":\"arXiv - CS - Human-Computer Interaction\",\"volume\":\"49 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Human-Computer Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Human-Computer Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

过去十年来，可穿戴计算设备（"智能眼镜"）在传感器技术、设计和处理能力方面取得了显著进步，开创了获取高密度人类行为数据的新纪元。这些眼镜配备了可穿戴摄像头，为分析自然环境中个体互动时的非语言行为提供了独特的机会。我们的重点是通过仔细观察语言和非语言线索来预测二人互动中的参与情况，旨在检测出不感兴趣或困惑的迹象。利用这种分析可能会彻底改变我们对人类交流的理解，促进专业环境中更有效的合作，通过移情虚拟互动提供更好的心理健康支持，并提高那些有交流障碍的人的可及性。在这项研究中，我们收集了一个数据集，其中包括 34 位参与了非正式二人对话的参与者，每个人都在每次对话结束时提供了自我报告的参与度评分。我们使用大型语言模型（LLM）引入了一种新颖的融合策略，将多种行为模式整合为 "多模态记录"，该记录可由大型语言模型处理，用于行为推理任务。值得注意的是，即使是在初步实施阶段，这种方法的性能也能与成熟的融合技术相媲美，这表明它具有进一步研究和优化的巨大潜力。这种融合方法是首批通过语言模型对真实世界中的人类行为进行 "推理 "的方法之一。智能眼镜为我们提供了一种能力，使我们能够有针对性地收集有关人类行为的高密度多模态数据，为理解和改进人类交流的新方法铺平了道路，并有可能带来重要的社会效益。研究期间收集的特征和数据将公开发表，以促进进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation

Over the past decade, wearable computing devices (``smart glasses'') have undergone remarkable advancements in sensor technology, design, and processing power, ushering in a new era of opportunity for high-density human behavior data. Equipped with wearable cameras, these glasses offer a unique opportunity to analyze non-verbal behavior in natural settings as individuals interact. Our focus lies in predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion. Leveraging such analyses may revolutionize our understanding of human communication, foster more effective collaboration in professional environments, provide better mental health support through empathetic virtual interactions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation. We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a ``multimodal transcript'' that can be processed by an LLM for behavioral reasoning tasks. Remarkably, this method achieves performance comparable to established fusion techniques even in its preliminary implementation, indicating strong potential for further research and optimization. This fusion method is one of the first to approach ``reasoning'' about real-world human behavior through a language model. Smart glasses provide us the ability to unobtrusively gather high-density multimodal data on human behavior, paving the way for new approaches to understanding and improving human communication with the potential for important societal benefits. The features and data collected during the studies will be made publicly available to promote further research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Human-Computer Interaction

自引率

0.00%

发文量