Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang
We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicated on www.lidarhumanmotion.net/hisc4d available for research purposes.
{"title":"HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR","authors":"Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang","doi":"arxiv-2409.04398","DOIUrl":"https://doi.org/arxiv-2409.04398","url":null,"abstract":"We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture\u0000method, aimed at accurately and efficiently creating a dynamic digital world,\u0000containing large-scale indoor-outdoor scenes, diverse human motions, rich\u0000human-human interactions, and human-environment interactions. By utilizing\u0000body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human\u0000motions in unconstrained space without the need for external devices and\u0000pre-built maps. This affords great flexibility and accessibility for\u0000human-centered interaction and 4D scene capturing in various environments.\u0000Taking into account that IMUs can capture human spatially unrestricted poses\u0000but are prone to drifting for long-period using, and while LiDAR is stable for\u0000global localization but rough for local positions and orientations, HiSC4D\u0000employs a joint optimization method, harmonizing all sensors and utilizing\u0000environment cues, yielding promising results for long-term capture in large\u0000scenes. To promote research of egocentric human interaction in large scenes and\u0000facilitate downstream tasks, we also present a dataset, containing 8 sequences\u0000in 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4D\u0000human motions with SMPL annotations and dynamic scenes, 31k frames of cropped\u0000human point clouds, and scene mesh of the environment. A variety of scenarios,\u0000such as the basketball gym and commercial street, alongside challenging human\u0000motions, such as daily greeting, one-on-one basketball playing, and tour\u0000guiding, demonstrate the effectiveness and the generalization ability of\u0000HiSC4D. The dataset and code will be publicated on\u0000www.lidarhumanmotion.net/hisc4d available for research purposes.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal Large Language Models (MLLMs) have shown excellent performance in question-answering of single-event videos. In this paper, we present question-answering dense video events, a novel task that requires answering and grounding the dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events occurring over extended time periods. To facilitate the study, we construct DeVE-QA - a dataset featuring 78K questions about 26K events on 10.6K long videos. We then benchmark and show that existing MLLMs excelling at single-event QA struggle to perform well in DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1 percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA respectively.
{"title":"Question-Answering Dense Video Events","authors":"Hangyu Qin, Junbin Xiao, Angela Yao","doi":"arxiv-2409.04388","DOIUrl":"https://doi.org/arxiv-2409.04388","url":null,"abstract":"Multimodal Large Language Models (MLLMs) have shown excellent performance in\u0000question-answering of single-event videos. In this paper, we present\u0000question-answering dense video events, a novel task that requires answering and\u0000grounding the dense-event questions in long videos, thus challenging MLLMs to\u0000faithfully comprehend and reason about multiple events occurring over extended\u0000time periods. To facilitate the study, we construct DeVE-QA - a dataset\u0000featuring 78K questions about 26K events on 10.6K long videos. We then\u0000benchmark and show that existing MLLMs excelling at single-event QA struggle to\u0000perform well in DeVE-QA. For improvement, we propose DeVi, a novel\u0000training-free MLLM approach that highlights a hierarchical captioning module, a\u0000temporal event memory module, and a self-consistency checking module to\u0000respectively detect, contextualize and memorize, and ground dense-events in\u0000long videos for question answering. Extensive experiments show that DeVi is\u0000superior at answering dense-event questions and grounding relevant video\u0000moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1\u0000percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA\u0000respectively.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"392 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carl De Sousa Trias, Mihai Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione
Nowadays, deep neural networks are used for solving complex tasks in several critical applications and protecting both their integrity and intellectual property rights (IPR) has become of utmost importance. To this end, we advance WaterMAS, a substitutive, white-box neural network watermarking method that improves the trade-off among robustness, imperceptibility, and computational complexity, while making provisions for increased data payload and security. WasterMAS insertion keeps unchanged the watermarked weights while sharpening their underlying gradient space. The robustness is thus ensured by limiting the attack's strength: even small alterations of the watermarked weights would impact the model's performance. The imperceptibility is ensured by inserting the watermark during the training process. The relationship among the WaterMAS data payload, imperceptibility, and robustness properties is discussed. The secret key is represented by the positions of the weights conveying the watermark, randomly chosen through multiple layers of the model. The security is evaluated by investigating the case in which an attacker would intercept the key. The experimental validations consider 5 models and 2 tasks (VGG16, ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3 for Cityscapes image segmentation) as well as 4 types of attacks (Gaussian noise addition, pruning, fine-tuning, and quantization). The code will be released open-source upon acceptance of the article.
{"title":"WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking","authors":"Carl De Sousa Trias, Mihai Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione","doi":"arxiv-2409.03902","DOIUrl":"https://doi.org/arxiv-2409.03902","url":null,"abstract":"Nowadays, deep neural networks are used for solving complex tasks in several\u0000critical applications and protecting both their integrity and intellectual\u0000property rights (IPR) has become of utmost importance. To this end, we advance\u0000WaterMAS, a substitutive, white-box neural network watermarking method that\u0000improves the trade-off among robustness, imperceptibility, and computational\u0000complexity, while making provisions for increased data payload and security.\u0000WasterMAS insertion keeps unchanged the watermarked weights while sharpening\u0000their underlying gradient space. The robustness is thus ensured by limiting the\u0000attack's strength: even small alterations of the watermarked weights would\u0000impact the model's performance. The imperceptibility is ensured by inserting\u0000the watermark during the training process. The relationship among the WaterMAS\u0000data payload, imperceptibility, and robustness properties is discussed. The\u0000secret key is represented by the positions of the weights conveying the\u0000watermark, randomly chosen through multiple layers of the model. The security\u0000is evaluated by investigating the case in which an attacker would intercept the\u0000key. The experimental validations consider 5 models and 2 tasks (VGG16,\u0000ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3\u0000for Cityscapes image segmentation) as well as 4 types of attacks (Gaussian\u0000noise addition, pruning, fine-tuning, and quantization). The code will be\u0000released open-source upon acceptance of the article.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu
Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.
{"title":"SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing","authors":"Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu","doi":"arxiv-2409.03605","DOIUrl":"https://doi.org/arxiv-2409.03605","url":null,"abstract":"Audio-driven talking face generation aims to synthesize video with lip\u0000movements synchronized to input audio. However, current generative techniques\u0000face challenges in preserving intricate regional textures (skin, teeth). To\u0000address the aforementioned challenges, we propose a novel framework called\u0000SegTalker to decouple lip movements and image textures by introducing\u0000segmentation as intermediate representation. Specifically, given the mask of\u0000image employed by a parsing network, we first leverage the speech to drive the\u0000mask and generate talking segmentation. Then we disentangle semantic regions of\u0000image into style codes using a mask-guided encoder. Ultimately, we inject the\u0000previously generated talking segmentation and style codes into a mask-guided\u0000StyleGAN to synthesize video frame. In this way, most of textures are fully\u0000preserved. Moreover, our approach can inherently achieve background separation\u0000and facilitate mask-guided facial local editing. In particular, by editing the\u0000mask and swapping the region textures from a given reference image (e.g. hair,\u0000lip, eyebrows), our approach enables facial editing seamlessly when generating\u0000talking face video. Experiments demonstrate that our proposed approach can\u0000effectively preserve texture details and generate temporally consistent video\u0000while remaining competitive in lip synchronization. Quantitative and\u0000qualitative results on the HDTF and MEAD datasets illustrate the superior\u0000performance of our method over existing methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin
One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.
{"title":"Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression","authors":"Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin","doi":"arxiv-2409.03385","DOIUrl":"https://doi.org/arxiv-2409.03385","url":null,"abstract":"One common belief is that with complex models and pre-training on large-scale\u0000datasets, transformer-based methods for referring expression comprehension\u0000(REC) perform much better than existing graph-based methods. We observe that\u0000since most graph-based methods adopt an off-the-shelf detector to locate\u0000candidate objects (i.e., regions detected by the object detector), they face\u0000two challenges that result in subpar performance: (1) the presence of\u0000significant noise caused by numerous irrelevant objects during reasoning, and\u0000(2) inaccurate localization outcomes attributed to the provided detector. To\u0000address these issues, we introduce a plug-and-adapt module guided by\u0000sub-expressions, called dynamic gate constraint (DGC), which can adaptively\u0000disable irrelevant proposals and their connections in graphs during reasoning.\u0000We further introduce an expression-guided regression strategy (EGR) to refine\u0000location prediction. Extensive experimental results on the RefCOCO, RefCOCO+,\u0000RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the\u0000effectiveness of the DGC module and the EGR strategy in consistently boosting\u0000the performances of various graph-based REC methods. Without any pretaining,\u0000the proposed graph-based method achieves better performance than the\u0000state-of-the-art (SOTA) transformer-based methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as textit{degraded performance with more images} and textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model textbf{LongLLaVA}~(textbf{Long}-Context textbf{L}arge textbf{L}anguage textbf{a}nd textbf{V}ision textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
{"title":"LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture","authors":"Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang","doi":"arxiv-2409.02889","DOIUrl":"https://doi.org/arxiv-2409.02889","url":null,"abstract":"Expanding the long-context capabilities of Multi-modal Large Language\u0000Models~(MLLMs) is crucial for video understanding, high-resolution image\u0000understanding, and multi-modal agents. This involves a series of systematic\u0000optimizations, including model architecture, data construction and training\u0000strategy, particularly addressing challenges such as textit{degraded\u0000performance with more images} and textit{high computational costs}. In this\u0000paper, we adapt the model architecture to a hybrid of Mamba and Transformer\u0000blocks, approach data construction with both temporal and spatial dependencies\u0000among multiple images and employ a progressive training strategy. The released\u0000model textbf{LongLLaVA}~(textbf{Long}-Context textbf{L}arge\u0000textbf{L}anguage textbf{a}nd textbf{V}ision textbf{A}ssistant) is the first\u0000hybrid MLLM, which achieved a better balance between efficiency and\u0000effectiveness. LongLLaVA not only achieves competitive results across various\u0000benchmarks, but also maintains high throughput and low memory consumption.\u0000Especially, it could process nearly a thousand images on a single A100 80GB\u0000GPU, showing promising application prospects for a wide range of tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua
Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.
面部表情识别(FER)是多媒体领域的一项重要任务,对各个领域都有重大影响。然而,分析面部表情的成因对于准确识别面部表情至关重要。目前的方法,如基于面部动作单元(AU)的方法,通常提供 AU 名称和强度,但缺乏对 AU 与整体表情之间的交互和关系的洞察。在本文中,我们提出了一种名为 ExpLLM 的新方法,该方法利用大型语言模型生成准确的面部表情识别思维链(CoT)。关键观察点描述了 AU 的名称、强度和相关情绪。最后,结论是根据前面的分析得出的最终表达标签。此外,我们还介绍了 Exp-CoT 引擎,该引擎旨在构建表达 CoT 并生成指令描述数据,以训练我们的 ExpLLM。在 RAF-DB 和 AffectNet 数据集上进行的大量实验表明,ExpLLM 优于当前最先进的 FER 方法。ExpLLM 在表达 CoT 生成方面也超过了最新的 GPT-4o,尤其是在识别 GPT-4o 经常失败的微表达方面。
{"title":"ExpLLM: Towards Chain of Thought for Facial Expression Recognition","authors":"Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua","doi":"arxiv-2409.02828","DOIUrl":"https://doi.org/arxiv-2409.02828","url":null,"abstract":"Facial expression recognition (FER) is a critical task in multimedia with\u0000significant implications across various domains. However, analyzing the causes\u0000of facial expressions is essential for accurately recognizing them. Current\u0000approaches, such as those based on facial action units (AUs), typically provide\u0000AU names and intensities but lack insight into the interactions and\u0000relationships between AUs and the overall expression. In this paper, we propose\u0000a novel method called ExpLLM, which leverages large language models to generate\u0000an accurate chain of thought (CoT) for facial expression recognition.\u0000Specifically, we have designed the CoT mechanism from three key perspectives:\u0000key observations, overall emotional interpretation, and conclusion. The key\u0000observations describe the AU's name, intensity, and associated emotions. The\u0000overall emotional interpretation provides an analysis based on multiple AUs and\u0000their interactions, identifying the dominant emotions and their relationships.\u0000Finally, the conclusion presents the final expression label derived from the\u0000preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed\u0000to construct this expression CoT and generate instruction-description data for\u0000training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets\u0000demonstrate that ExpLLM outperforms current state-of-the-art FER methods.\u0000ExpLLM also surpasses the latest GPT-4o in expression CoT generation,\u0000particularly in recognizing micro-expressions where GPT-4o frequently fails.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.
{"title":"PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation","authors":"Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song","doi":"arxiv-2409.02657","DOIUrl":"https://doi.org/arxiv-2409.02657","url":null,"abstract":"While previous audio-driven talking head generation (THG) methods generate\u0000head poses from driving audio, the generated poses or lips cannot match the\u0000audio well or are not editable. In this study, we propose textbf{PoseTalk}, a\u0000THG system that can freely generate lip-synchronized talking head videos with\u0000free head poses conditioned on text prompts and audio. The core insight of our\u0000method is using head pose to connect visual, linguistic, and audio signals.\u0000First, we propose to generate poses from both audio and text prompts, where the\u0000audio offers short-term variations and rhythm correspondence of the head\u0000movements and the text prompts describe the long-term semantics of head\u0000motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to\u0000generate motion latent from text prompts and audio cues in a pose latent space.\u0000Second, we observe a loss-imbalance problem: the loss for the lip region\u0000contributes less than 4% of the total reconstruction loss caused by both pose\u0000and lip, making optimization lean towards head movements rather than lip\u0000shapes. To address this issue, we propose a refinement-based learning strategy\u0000to synthesize natural talking videos using two cascaded networks, i.e.,\u0000CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce\u0000animated images in novel poses and the RefineNet focuses on learning finer lip\u0000motions by progressively estimating lip motions from low-to-high resolutions,\u0000yielding improved lip-synchronization performance. Experiments demonstrate our\u0000pose prediction strategy achieves better pose diversity and realness compared\u0000to text-only or audio-only, and our video generator model outperforms\u0000state-of-the-art methods in synthesizing talking videos with natural head\u0000motions. Project: https://junleen.github.io/projects/posetalk.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
{"title":"Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation","authors":"Kangkai Zhang, Shiming Ge, Ruixin Shi, Dan Zeng","doi":"arxiv-2409.02555","DOIUrl":"https://doi.org/arxiv-2409.02555","url":null,"abstract":"Recognizing objects in low-resolution images is a challenging task due to the\u0000lack of informative details. Recent studies have shown that knowledge\u0000distillation approaches can effectively transfer knowledge from a\u0000high-resolution teacher model to a low-resolution student model by aligning\u0000cross-resolution representations. However, these approaches still face\u0000limitations in adapting to the situation where the recognized objects exhibit\u0000significant representation discrepancies between training and testing images.\u0000In this study, we propose a cross-resolution relational contrastive\u0000distillation approach to facilitate low-resolution object recognition. Our\u0000approach enables the student model to mimic the behavior of a well-trained\u0000teacher model which delivers high accuracy in identifying high-resolution\u0000objects. To extract sufficient knowledge, the student learning is supervised\u0000with contrastive relational distillation loss, which preserves the similarities\u0000in various relational structures in contrastive representation space. In this\u0000manner, the capability of recovering missing details of familiar low-resolution\u0000objects can be effectively enhanced, leading to a better knowledge transfer.\u0000Extensive experiments on low-resolution object classification and\u0000low-resolution face recognition clearly demonstrate the effectiveness and\u0000adaptability of our approach.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie FuUniversity of the Arts London, Creative Computing Institute, London, United Kingdom, Shun FuBloks Technology Company, Shanghai, China, Mick GriersonUniversity of the Arts London, Creative Computing Institute, London, United Kingdom
With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive "artworks," best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.
随着虚拟现实技术的迅猛发展,对高质量 3D 模型的需求与日俱增。传统方法在大规模定制中难以保证效率和质量。本文介绍了一种深度学习框架,它能从单张图像生成高精度的三维珊瑚模型。该框架使用珊瑚数据集提取几何和纹理特征,执行三维重建,并优化设计和材料混合。先进的优化和多边形数量控制确保了形状的准确性、细节的保留以及各种复杂性的灵活输出,满足了高质量渲染和实时交互的需求。该项目结合了可解释人工智能(XAI),将人工智能生成的模型转化为交互式 "艺术品",在 VR 和 XR 中观看效果最佳。这增强了模型的可解释性和人机协作。VR 交互中的实时反馈显示了珊瑚种类和栖息地等信息,丰富了用户体验。生成的模型在细节、视觉质量和效率方面都超越了传统方法。这项研究为 VR 3D 内容创建提供了一种智能方法,降低了制作门槛,促进了 VR 应用的普及。此外,整合 XAI 还为人工智能生成的视觉内容提供了新的见解,并推动了 3D 视觉可解释性方面的研究。
{"title":"Coral Model Generation from Single Images for Virtual Reality Applications","authors":"Jie FuUniversity of the Arts London, Creative Computing Institute, London, United Kingdom, Shun FuBloks Technology Company, Shanghai, China, Mick GriersonUniversity of the Arts London, Creative Computing Institute, London, United Kingdom","doi":"arxiv-2409.02376","DOIUrl":"https://doi.org/arxiv-2409.02376","url":null,"abstract":"With the rapid development of VR technology, the demand for high-quality 3D\u0000models is increasing. Traditional methods struggle with efficiency and quality\u0000in large-scale customization. This paper introduces a deep-learning framework\u0000that generates high-precision 3D coral models from a single image. Using the\u0000Coral dataset, the framework extracts geometric and texture features, performs\u00003D reconstruction, and optimizes design and material blending. Advanced\u0000optimization and polygon count control ensure shape accuracy, detail retention,\u0000and flexible output for various complexities, catering to high-quality\u0000rendering and real-time interaction needs.The project incorporates Explainable\u0000AI (XAI) to transform AI-generated models into interactive \"artworks,\" best\u0000viewed in VR and XR. This enhances model interpretability and human-machine\u0000collaboration. Real-time feedback in VR interactions displays information like\u0000coral species and habitat, enriching user experience. The generated models\u0000surpass traditional methods in detail, visual quality, and efficiency. This\u0000research offers an intelligent approach to 3D content creation for VR, lowering\u0000production barriers, and promoting widespread VR applications. Additionally,\u0000integrating XAI provides new insights into AI-generated visual content and\u0000advances research in 3D vision interpretability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}