首页 > 最新文献

arXiv - CS - Multimedia最新文献

英文 中文
MultiMediate'24: Multi-Domain Engagement Estimation MultiMediate'24:多领域参与度评估
Pub Date : 2024-08-29 DOI: arxiv-2408.16625
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
Estimating the momentary level of participant's engagement is an importantprerequisite for assistive systems that support human interactions. Previouswork has addressed this task in within-domain evaluation scenarios, i.e.training and testing on the same dataset. This is in contrast to real-lifescenarios where domain shifts between training and testing data frequentlyoccur. With MultiMediate'24, we present the first challenge addressingmulti-domain engagement estimation. As training data, we utilise the NOXIdatabase of dyadic novice-expert interactions. In addition to within-domaintest data, we add two new test domains. First, we introduce recordingsfollowing the NOXI protocol but covering languages that are not present in theNOXI training data. Second, we collected novel engagement annotations on theMPIIGroupInteraction dataset which consists of group discussions between threeto four people. In this way, MultiMediate'24 evaluates the ability ofapproaches to generalise across factors such as language and culturalbackground, group size, task, and screen-mediated vs. face-to-face interaction.This paper describes the MultiMediate'24 challenge and presents baselineresults. In addition, we discuss selected challenge solutions.
估计参与者的瞬间参与程度是支持人类交互的辅助系统的一个重要前提。以前的工作是在领域内评估场景中完成这项任务,即在同一数据集上进行训练和测试。这与现实生活中经常出现训练数据和测试数据之间的领域转换形成了鲜明对比。通过 MultiMediate'24,我们首次提出了针对多领域参与度估计的挑战。作为训练数据,我们使用了新手与专家互动的 NOXIdatabase。除了内部测试数据外,我们还增加了两个新的测试域。首先,我们引入了遵循 NOXI 协议的录音,但涵盖了 NOXI 培训数据中没有的语言。其次,我们在 MPIIGroupInteraction 数据集上收集了新的参与注释,该数据集由三到四人的小组讨论组成。通过这种方式,MultiMediate'24 评估了各种方法跨越语言和文化背景、小组规模、任务以及屏幕媒介与面对面交互等因素的通用能力。此外,我们还讨论了选定的挑战解决方案。
{"title":"MultiMediate'24: Multi-Domain Engagement Estimation","authors":"Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling","doi":"arxiv-2408.16625","DOIUrl":"https://doi.org/arxiv-2408.16625","url":null,"abstract":"Estimating the momentary level of participant's engagement is an important\u0000prerequisite for assistive systems that support human interactions. Previous\u0000work has addressed this task in within-domain evaluation scenarios, i.e.\u0000training and testing on the same dataset. This is in contrast to real-life\u0000scenarios where domain shifts between training and testing data frequently\u0000occur. With MultiMediate'24, we present the first challenge addressing\u0000multi-domain engagement estimation. As training data, we utilise the NOXI\u0000database of dyadic novice-expert interactions. In addition to within-domain\u0000test data, we add two new test domains. First, we introduce recordings\u0000following the NOXI protocol but covering languages that are not present in the\u0000NOXI training data. Second, we collected novel engagement annotations on the\u0000MPIIGroupInteraction dataset which consists of group discussions between three\u0000to four people. In this way, MultiMediate'24 evaluates the ability of\u0000approaches to generalise across factors such as language and cultural\u0000background, group size, task, and screen-mediated vs. face-to-face interaction.\u0000This paper describes the MultiMediate'24 challenge and presents baseline\u0000results. In addition, we discuss selected challenge solutions.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
See or Guess: Counterfactually Regularized Image Captioning 看或猜:反事实正则化图像字幕制作
Pub Date : 2024-08-29 DOI: arxiv-2408.16809
Qian Cao, Xu Chen, Ruihua Song, Xiting Wang, Xinting Huang, Yuchen Ren
Image captioning, which generates natural language descriptions of the visualinformation in an image, is a crucial task in vision-language research.Previous models have typically addressed this task by aligning the generativecapabilities of machines with human intelligence through statistical fitting ofexisting datasets. While effective for normal images, they may struggle toaccurately describe those where certain parts of the image are obscured oredited, unlike humans who excel in such cases. These weaknesses they exhibit,including hallucinations and limited interpretability, often hinder performancein scenarios with shifted association patterns. In this paper, we present ageneric image captioning framework that employs causal inference to makeexisting models more capable of interventional tasks, and counterfactuallyexplainable. Our approach includes two variants leveraging either total effector natural direct effect. Integrating them into the training process enablesmodels to handle counterfactual scenarios, increasing their generalizability.Extensive experiments on various datasets show that our method effectivelyreduces hallucinations and improves the model's faithfulness to images,demonstrating high portability across both small-scale and large-scaleimage-to-text models. The code is available athttps://github.com/Aman-4-Real/See-or-Guess.
为图像中的视觉信息生成自然语言描述的图像标题是视觉语言研究中的一项重要任务。以往的模型通常是通过对现有数据集进行统计拟合,将机器的生成能力与人类智能相匹配,从而完成这项任务。虽然这些模型对正常图像很有效,但在准确描述图像某些部分被遮挡的情况时,它们可能会遇到困难,而人类在这种情况下则表现出色。它们所表现出的这些弱点,包括幻觉和有限的可解释性,往往会妨碍它们在联想模式发生变化的场景中的表现。在本文中,我们提出了一个通用的图像字幕框架,该框架采用因果推理,使现有模型更能胜任干预任务,并可反事实解释。我们的方法包括利用总效应或自然直接效应的两种变体。在各种数据集上进行的大量实验表明,我们的方法有效地减少了幻觉,提高了模型对图像的忠实度,在小规模和大规模图像到文本模型中都表现出很高的可移植性。代码可在https://github.com/Aman-4-Real/See-or-Guess。
{"title":"See or Guess: Counterfactually Regularized Image Captioning","authors":"Qian Cao, Xu Chen, Ruihua Song, Xiting Wang, Xinting Huang, Yuchen Ren","doi":"arxiv-2408.16809","DOIUrl":"https://doi.org/arxiv-2408.16809","url":null,"abstract":"Image captioning, which generates natural language descriptions of the visual\u0000information in an image, is a crucial task in vision-language research.\u0000Previous models have typically addressed this task by aligning the generative\u0000capabilities of machines with human intelligence through statistical fitting of\u0000existing datasets. While effective for normal images, they may struggle to\u0000accurately describe those where certain parts of the image are obscured or\u0000edited, unlike humans who excel in such cases. These weaknesses they exhibit,\u0000including hallucinations and limited interpretability, often hinder performance\u0000in scenarios with shifted association patterns. In this paper, we present a\u0000generic image captioning framework that employs causal inference to make\u0000existing models more capable of interventional tasks, and counterfactually\u0000explainable. Our approach includes two variants leveraging either total effect\u0000or natural direct effect. Integrating them into the training process enables\u0000models to handle counterfactual scenarios, increasing their generalizability.\u0000Extensive experiments on various datasets show that our method effectively\u0000reduces hallucinations and improves the model's faithfulness to images,\u0000demonstrating high portability across both small-scale and large-scale\u0000image-to-text models. The code is available at\u0000https://github.com/Aman-4-Real/See-or-Guess.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSLIQA: Enhancing Learning Representations for Image Quality Assessment through Multi-Scale Learning MSLIQA:通过多尺度学习增强图像质量评估的学习表示法
Pub Date : 2024-08-29 DOI: arxiv-2408.16879
Nasim Jamshidi Avanaki, Abhijay Ghildiyal, Nabajeet Barman, Saman Zadtootaghaj
No-Reference Image Quality Assessment (NR-IQA) remains a challenging task dueto the diversity of distortions and the lack of large annotated datasets. Manystudies have attempted to tackle these challenges by developing more accurateNR-IQA models, often employing complex and computationally expensive networks,or by bridging the domain gap between various distortions to enhanceperformance on test datasets. In our work, we improve the performance of ageneric lightweight NR-IQA model by introducing a novel augmentation strategythat boosts its performance by almost 28%. This augmentation strategy enablesthe network to better discriminate between different distortions in variousparts of the image by zooming in and out. Additionally, the inclusion oftest-time augmentation further enhances performance, making our lightweightnetwork's results comparable to the current state-of-the-art models, simplythrough the use of augmentations.
无参考图像质量评估(NR-IQA)仍然是一项极具挑战性的任务,原因在于失真现象的多样性和缺乏大型注释数据集。许多研究都试图通过开发更精确的无参考图像质量评估模型(通常采用复杂且计算成本高昂的网络),或者通过弥合各种失真之间的领域差距来提高测试数据集上的性能,从而应对这些挑战。在我们的工作中,我们通过引入一种新颖的增强策略,提高了通用轻量级 NR-IQA 模型的性能,使其性能提升了近 28%。这种增强策略使网络能够通过放大和缩小图像,更好地分辨图像不同部分的不同失真。此外,测试时间增强功能的加入进一步提高了性能,使得我们的轻量级网络仅通过使用增强功能就能与当前最先进的模型相媲美。
{"title":"MSLIQA: Enhancing Learning Representations for Image Quality Assessment through Multi-Scale Learning","authors":"Nasim Jamshidi Avanaki, Abhijay Ghildiyal, Nabajeet Barman, Saman Zadtootaghaj","doi":"arxiv-2408.16879","DOIUrl":"https://doi.org/arxiv-2408.16879","url":null,"abstract":"No-Reference Image Quality Assessment (NR-IQA) remains a challenging task due\u0000to the diversity of distortions and the lack of large annotated datasets. Many\u0000studies have attempted to tackle these challenges by developing more accurate\u0000NR-IQA models, often employing complex and computationally expensive networks,\u0000or by bridging the domain gap between various distortions to enhance\u0000performance on test datasets. In our work, we improve the performance of a\u0000generic lightweight NR-IQA model by introducing a novel augmentation strategy\u0000that boosts its performance by almost 28%. This augmentation strategy enables\u0000the network to better discriminate between different distortions in various\u0000parts of the image by zooming in and out. Additionally, the inclusion of\u0000test-time augmentation further enhances performance, making our lightweight\u0000network's results comparable to the current state-of-the-art models, simply\u0000through the use of augmentations.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Simple Baseline with Single-encoder for Referring Image Segmentation 用于参考图像分割的单编码器简单基线
Pub Date : 2024-08-28 DOI: arxiv-2408.15521
Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son
Referring image segmentation (RIS) requires dense vision-languageinteractions between visual pixels and textual words to segment objects basedon a given description. However, commonly adapted dual-encoders in RIS, e.g.,Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modaldual-encoder), lack dense multi-modal interactions during pre-training, leadingto a gap with a pixel-level RIS task. To bridge this gap, existing RIS methodsoften rely on multi-modal fusion modules that interact two encoders, but thisapproach leads to high computational costs. In this paper, we present a novelRIS method with a single-encoder, i.e., BEiT-3, maximizing the potential ofshared self-attention across all framework components. This enables seamlessinteractions of two modalities from input to final prediction, producinggranularly aligned multi-modal features. Furthermore, we propose lightweightyet effective decoder modules, a Shared FPN and a Shared Mask Decoder, whichcontribute to the high efficiency of our model. Our simple baseline with asingle encoder achieves outstanding performances on the RIS benchmark datasetswhile maintaining computational efficiency, compared to the most recent SoTAmethods based on dual-encoders.
参考图像分割(RIS)需要视觉像素和文字之间密集的视觉语言交互,以便根据给定的描述分割对象。然而,RIS 中常用的双编码器,如 Swin transformer 和 BERT(单模态编码器)或 CLIP(多模态双编码器),在预训练时缺乏密集的多模态交互,导致与像素级 RIS 任务之间存在差距。为了弥补这一差距,现有的 RIS 方法通常依赖于两个编码器交互的多模态融合模块,但这种方法会导致很高的计算成本。在本文中,我们提出了一种使用单编码器(即 BEiT-3)的新型 RIS 方法,最大限度地发挥了所有框架组件共享自我关注的潜力。这就实现了从输入到最终预测的两种模态的无缝交互,产生粒度一致的多模态特征。此外,我们还提出了轻量级但有效的解码器模块,即共享 FPN 和共享掩码解码器,这有助于提高我们模型的效率。与基于双编码器的最新 SoTA 方法相比,我们使用单编码器的简单基线在 RIS 基准数据集上实现了出色的性能,同时保持了计算效率。
{"title":"A Simple Baseline with Single-encoder for Referring Image Segmentation","authors":"Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son","doi":"arxiv-2408.15521","DOIUrl":"https://doi.org/arxiv-2408.15521","url":null,"abstract":"Referring image segmentation (RIS) requires dense vision-language\u0000interactions between visual pixels and textual words to segment objects based\u0000on a given description. However, commonly adapted dual-encoders in RIS, e.g.,\u0000Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal\u0000dual-encoder), lack dense multi-modal interactions during pre-training, leading\u0000to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods\u0000often rely on multi-modal fusion modules that interact two encoders, but this\u0000approach leads to high computational costs. In this paper, we present a novel\u0000RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of\u0000shared self-attention across all framework components. This enables seamless\u0000interactions of two modalities from input to final prediction, producing\u0000granularly aligned multi-modal features. Furthermore, we propose lightweight\u0000yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which\u0000contribute to the high efficiency of our model. Our simple baseline with a\u0000single encoder achieves outstanding performances on the RIS benchmark datasets\u0000while maintaining computational efficiency, compared to the most recent SoTA\u0000methods based on dual-encoders.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input 袋鼠支持长语境视频输入的强大视频语言模型
Pub Date : 2024-08-28 DOI: arxiv-2408.15542
Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu
Rapid advancements have been made in extending Large Language Models (LLMs)to Large Multi-modal Models (LMMs). However, extending input modality of LLMsto video data remains a challenging endeavor, especially for long videos. Dueto insufficient access to large-scale high-quality video data and the excessivecompression of visual features, current methods exhibit limitations ineffectively processing long videos. In this paper, we introduce Kangaroo, apowerful Video LMM aimed at addressing these challenges. Confronted with issueof inadequate training data, we develop a data curation system to build alarge-scale dataset with high-quality annotations for vision-languagepre-training and instruction tuning. In addition, we design a curriculumtraining pipeline with gradually increasing resolution and number of inputframes to accommodate long videos. Evaluation results demonstrate that, with 8Bparameters, Kangaroo achieves state-of-the-art performance across a variety ofvideo understanding benchmarks while exhibiting competitive results on others.Particularly, on benchmarks specialized for long videos, Kangaroo excels somelarger models with over 10B parameters and proprietary models.
在将大型语言模型(LLM)扩展到大型多模态模型(LMM)方面取得了快速进展。然而,将 LLM 的输入模式扩展到视频数据仍然是一项具有挑战性的工作,尤其是对于长视频而言。由于无法获得足够的大规模高质量视频数据以及视觉特征的过度压缩,目前的方法在处理长视频方面表现出了局限性。在本文中,我们介绍了旨在应对这些挑战的强大视频 LMM--Kangaroo。面对训练数据不足的问题,我们开发了一个数据整理系统,以建立一个具有高质量注释的大规模数据集,用于视觉语言的预训练和指令调整。此外,我们还设计了一个课程训练管道,逐步提高分辨率和输入帧数,以适应长视频。评估结果表明,在拥有 8B 参数的情况下,Kangaroo 在各种视频理解基准测试中都取得了最先进的性能,同时在其他基准测试中也表现出了极具竞争力的结果,尤其是在专门针对长视频的基准测试中,Kangaroo 在参数超过 10B 的大型模型和专有模型中表现出色。
{"title":"Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input","authors":"Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu","doi":"arxiv-2408.15542","DOIUrl":"https://doi.org/arxiv-2408.15542","url":null,"abstract":"Rapid advancements have been made in extending Large Language Models (LLMs)\u0000to Large Multi-modal Models (LMMs). However, extending input modality of LLMs\u0000to video data remains a challenging endeavor, especially for long videos. Due\u0000to insufficient access to large-scale high-quality video data and the excessive\u0000compression of visual features, current methods exhibit limitations in\u0000effectively processing long videos. In this paper, we introduce Kangaroo, a\u0000powerful Video LMM aimed at addressing these challenges. Confronted with issue\u0000of inadequate training data, we develop a data curation system to build a\u0000large-scale dataset with high-quality annotations for vision-language\u0000pre-training and instruction tuning. In addition, we design a curriculum\u0000training pipeline with gradually increasing resolution and number of input\u0000frames to accommodate long videos. Evaluation results demonstrate that, with 8B\u0000parameters, Kangaroo achieves state-of-the-art performance across a variety of\u0000video understanding benchmarks while exhibiting competitive results on others.\u0000Particularly, on benchmarks specialized for long videos, Kangaroo excels some\u0000larger models with over 10B parameters and proprietary models.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hand1000: Generating Realistic Hands from Text with Only 1,000 Images Hand1000:仅用 1,000 张图片就能从文本生成逼真的手部图像
Pub Date : 2024-08-28 DOI: arxiv-2408.15461
Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao
Text-to-image generation models have achieved remarkable advancements inrecent years, aiming to produce realistic images from textual descriptions.However, these models often struggle with generating anatomically accuraterepresentations of human hands. The resulting images frequently exhibit issuessuch as incorrect numbers of fingers, unnatural twisting or interlacing offingers, or blurred and indistinct hands. These issues stem from the inherentcomplexity of hand structures and the difficulty in aligning textualdescriptions with precise visual depictions of hands. To address thesechallenges, we propose a novel approach named Hand1000 that enables thegeneration of realistic hand images with target gesture using only 1,000training samples. The training of Hand1000 is divided into three stages withthe first stage aiming to enhance the model's understanding of hand anatomy byusing a pre-trained hand gesture recognition model to extract gesturerepresentation. The second stage further optimizes text embedding byincorporating the extracted hand gesture representation, to improve alignmentbetween the textual descriptions and the generated hand images. The third stageutilizes the optimized embedding to fine-tune the Stable Diffusion model togenerate realistic hand images. In addition, we construct the first publiclyavailable dataset specifically designed for text-to-hand image generation.Based on the existing hand gesture recognition dataset, we adopt advanced imagecaptioning models and LLaMA3 to generate high-quality textual descriptionsenriched with detailed gesture information. Extensive experiments demonstratethat Hand1000 significantly outperforms existing models in producinganatomically correct hand images while faithfully representing other details inthe text, such as faces, clothing, and colors.
近年来,文本到图像生成模型取得了显著的进步,其目标是根据文本描述生成逼真的图像。然而,这些模型在生成解剖学上准确的人类手部图像时往往会遇到困难,生成的图像经常出现手指数量不正确、手指扭曲或交错不自然、手部模糊不清等问题。这些问题源于手部结构固有的复杂性,以及将文字描述与手部精确视觉描述相统一的困难。为了解决这些难题,我们提出了一种名为 Hand1000 的新方法,只需使用 1,000 个训练样本就能生成具有目标手势的逼真手部图像。Hand1000 的训练分为三个阶段,第一阶段旨在通过使用预先训练好的手势识别模型来提取手势表示,从而增强模型对手部解剖的理解。第二阶段结合提取的手势表示进一步优化文本嵌入,以提高文本描述与生成的手部图像之间的一致性。第三阶段利用优化后的嵌入对稳定扩散模型进行微调,以生成逼真的手部图像。在现有手势识别数据集的基础上,我们采用先进的图像捕捉模型和 LLaMA3 来生成包含详细手势信息的高质量文本描述。广泛的实验证明,Hand1000 在生成解剖正确的手部图像方面明显优于现有模型,同时还能忠实呈现文本中的其他细节,如脸部、衣服和颜色等。
{"title":"Hand1000: Generating Realistic Hands from Text with Only 1,000 Images","authors":"Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao","doi":"arxiv-2408.15461","DOIUrl":"https://doi.org/arxiv-2408.15461","url":null,"abstract":"Text-to-image generation models have achieved remarkable advancements in\u0000recent years, aiming to produce realistic images from textual descriptions.\u0000However, these models often struggle with generating anatomically accurate\u0000representations of human hands. The resulting images frequently exhibit issues\u0000such as incorrect numbers of fingers, unnatural twisting or interlacing of\u0000fingers, or blurred and indistinct hands. These issues stem from the inherent\u0000complexity of hand structures and the difficulty in aligning textual\u0000descriptions with precise visual depictions of hands. To address these\u0000challenges, we propose a novel approach named Hand1000 that enables the\u0000generation of realistic hand images with target gesture using only 1,000\u0000training samples. The training of Hand1000 is divided into three stages with\u0000the first stage aiming to enhance the model's understanding of hand anatomy by\u0000using a pre-trained hand gesture recognition model to extract gesture\u0000representation. The second stage further optimizes text embedding by\u0000incorporating the extracted hand gesture representation, to improve alignment\u0000between the textual descriptions and the generated hand images. The third stage\u0000utilizes the optimized embedding to fine-tune the Stable Diffusion model to\u0000generate realistic hand images. In addition, we construct the first publicly\u0000available dataset specifically designed for text-to-hand image generation.\u0000Based on the existing hand gesture recognition dataset, we adopt advanced image\u0000captioning models and LLaMA3 to generate high-quality textual descriptions\u0000enriched with detailed gesture information. Extensive experiments demonstrate\u0000that Hand1000 significantly outperforms existing models in producing\u0000anatomically correct hand images while faithfully representing other details in\u0000the text, such as faces, clothing, and colors.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sec2Sec Co-attention for Video-Based Apparent Affective Prediction 基于视频的显性情感预测的 Sec2Sec 协同关注
Pub Date : 2024-08-27 DOI: arxiv-2408.15209
Mingwei Sun, Kunpeng Zhang
Video-based apparent affect detection plays a crucial role in videounderstanding, as it encompasses various elements such as vision, audio,audio-visual interactions, and spatiotemporal information, which are essentialfor accurate video predictions. However, existing approaches often focus onextracting only a subset of these elements, resulting in the limited predictivecapacity of their models. To address this limitation, we propose a novelLSTM-based network augmented with a Transformer co-attention mechanism forpredicting apparent affect in videos. We demonstrate that our proposed Sec2SecCo-attention Transformer surpasses multiple state-of-the-art methods inpredicting apparent affect on two widely used datasets: LIRIS-ACCEDE and FirstImpressions. Notably, our model offers interpretability, allowing us to examinethe contributions of different time points to the overall prediction. Theimplementation is available at: https://github.com/nestor-sun/sec2sec.
基于视频的表观情感检测在视频理解中起着至关重要的作用,因为它包含了视觉、音频、视听交互和时空信息等多种元素,这些元素对于准确的视频预测至关重要。然而,现有的方法往往只能提取这些元素的一个子集,导致其模型的预测能力有限。为了解决这一局限性,我们提出了一种基于 LSTM 的新型网络,并增加了 Transformer 共同关注机制,用于预测视频中的明显情感。在两个广泛使用的数据集上,我们证明了我们提出的 Sec2SecCo-attention Transformer 在预测表观情感方面超越了多种最先进的方法:LIRIS-ACCEDE 和 FirstImpressions。值得注意的是,我们的模型具有可解释性,允许我们检查不同时间点对整体预测的贡献。具体实施请访问:https://github.com/nestor-sun/sec2sec。
{"title":"Sec2Sec Co-attention for Video-Based Apparent Affective Prediction","authors":"Mingwei Sun, Kunpeng Zhang","doi":"arxiv-2408.15209","DOIUrl":"https://doi.org/arxiv-2408.15209","url":null,"abstract":"Video-based apparent affect detection plays a crucial role in video\u0000understanding, as it encompasses various elements such as vision, audio,\u0000audio-visual interactions, and spatiotemporal information, which are essential\u0000for accurate video predictions. However, existing approaches often focus on\u0000extracting only a subset of these elements, resulting in the limited predictive\u0000capacity of their models. To address this limitation, we propose a novel\u0000LSTM-based network augmented with a Transformer co-attention mechanism for\u0000predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec\u0000Co-attention Transformer surpasses multiple state-of-the-art methods in\u0000predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First\u0000Impressions. Notably, our model offers interpretability, allowing us to examine\u0000the contributions of different time points to the overall prediction. The\u0000implementation is available at: https://github.com/nestor-sun/sec2sec.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding SynthDoc:用于可视化文档理解的双语文档合成
Pub Date : 2024-08-27 DOI: arxiv-2408.14764
Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan
This paper introduces SynthDoc, a novel synthetic document generationpipeline designed to enhance Visual Document Understanding (VDU) by generatinghigh-quality, diverse datasets that include text, images, tables, and charts.Addressing the challenges of data acquisition and the limitations of existingdatasets, SynthDoc leverages publicly available corpora and advanced renderingtools to create a comprehensive and versatile dataset. Our experiments,conducted using the Donut model, demonstrate that models trained withSynthDoc's data achieve superior performance in pre-training read tasks andmaintain robustness in downstream tasks, despite language inconsistencies. Therelease of a benchmark dataset comprising 5,000 image-text pairs not onlyshowcases the pipeline's capabilities but also provides a valuable resource forthe VDU community to advance research and development in document imagerecognition. This work significantly contributes to the field by offering ascalable solution to data scarcity and by validating the efficacy of end-to-endmodels in parsing complex, real-world documents.
本文介绍了 SynthDoc,这是一种新颖的合成文档生成管道,旨在通过生成包括文本、图像、表格和图表在内的高质量、多样化数据集来增强可视化文档理解(VDU)能力。为了应对数据获取方面的挑战和现有数据集的局限性,SynthDoc 利用公开可用的语料库和先进的渲染工具创建了一个全面而多用途的数据集。我们使用 Donut 模型进行的实验表明,使用 SynthDoc 数据训练的模型在预训练阅读任务中表现出色,并且在下游任务中保持稳健性,尽管存在语言不一致的问题。此次发布的基准数据集包括 5000 个图像-文本对,不仅展示了该管道的能力,还为 VDU 界提供了宝贵的资源,有助于推动文档图像识别的研究和开发。这项工作为数据稀缺问题提供了可扩展的解决方案,并验证了端到端模型在解析复杂的真实世界文档中的有效性,从而为该领域做出了重大贡献。
{"title":"SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding","authors":"Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan","doi":"arxiv-2408.14764","DOIUrl":"https://doi.org/arxiv-2408.14764","url":null,"abstract":"This paper introduces SynthDoc, a novel synthetic document generation\u0000pipeline designed to enhance Visual Document Understanding (VDU) by generating\u0000high-quality, diverse datasets that include text, images, tables, and charts.\u0000Addressing the challenges of data acquisition and the limitations of existing\u0000datasets, SynthDoc leverages publicly available corpora and advanced rendering\u0000tools to create a comprehensive and versatile dataset. Our experiments,\u0000conducted using the Donut model, demonstrate that models trained with\u0000SynthDoc's data achieve superior performance in pre-training read tasks and\u0000maintain robustness in downstream tasks, despite language inconsistencies. The\u0000release of a benchmark dataset comprising 5,000 image-text pairs not only\u0000showcases the pipeline's capabilities but also provides a valuable resource for\u0000the VDU community to advance research and development in document image\u0000recognition. This work significantly contributes to the field by offering a\u0000scalable solution to data scarcity and by validating the efficacy of end-to-end\u0000models in parsing complex, real-world documents.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Alfie: Democratising RGBA Image Generation With No $$$ 阿尔菲不花一分钱实现 RGBA 图像生成的民主化
Pub Date : 2024-08-27 DOI: arxiv-2408.14826
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
Designs and artworks are ubiquitous across various creative fields, requiringgraphic design skills and dedicated software to create compositions thatinclude many graphical elements, such as logos, icons, symbols, and art scenes,which are integral to visual storytelling. Automating the generation of suchvisual elements improves graphic designers' productivity, democratizes andinnovates the creative industry, and helps generate more realistic syntheticdata for related tasks. These illustration elements are mostly RGBA images withirregular shapes and cutouts, facilitating blending and scene composition.However, most image generation models are incapable of generating such imagesand achieving this capability requires expensive computational resources,specific training recipes, or post-processing solutions. In this work, wepropose a fully-automated approach for obtaining RGBA illustrations bymodifying the inference-time behavior of a pre-trained Diffusion Transformermodel, exploiting the prompt-guided controllability and visual quality offeredby such models with no additional computational cost. We force the generationof entire subjects without sharp croppings, whose background is easily removedfor seamless integration into design projects or artistic scenes. We show witha user study that, in most cases, users prefer our solution over generating andthen matting an image, and we show that our generated illustrations yield goodresults when used as inputs for composite scene generation pipelines. Werelease the code at https://github.com/aimagelab/Alfie.
设计和艺术作品在各个创意领域无处不在,需要图形设计技能和专用软件来创建包含许多图形元素的作品,如徽标、图标、符号和艺术场景,这些元素是视觉叙事不可或缺的组成部分。自动生成这些视觉元素可以提高平面设计师的工作效率,实现创意产业的民主化和创新,并有助于为相关任务生成更逼真的合成数据。然而,大多数图像生成模型都无法生成此类图像,要实现这一功能需要昂贵的计算资源、特定的训练配方或后处理解决方案。在这项工作中,我们提出了一种全自动方法,通过修改预先训练好的扩散变换模型的推理时间行为来获取 RGBA 插图,利用这类模型提供的即时指导可控性和视觉质量,而无需额外的计算成本。我们强制生成没有锐利裁剪的整个主体,其背景可以轻松去除,以便无缝集成到设计项目或艺术场景中。我们通过用户研究表明,在大多数情况下,用户更喜欢我们的解决方案,而不是先生成图像,然后再进行垫底处理,而且我们还表明,我们生成的插图在用作合成场景生成管道的输入时效果良好。代码发布在 https://github.com/aimagelab/Alfie。
{"title":"Alfie: Democratising RGBA Image Generation With No $$$","authors":"Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara","doi":"arxiv-2408.14826","DOIUrl":"https://doi.org/arxiv-2408.14826","url":null,"abstract":"Designs and artworks are ubiquitous across various creative fields, requiring\u0000graphic design skills and dedicated software to create compositions that\u0000include many graphical elements, such as logos, icons, symbols, and art scenes,\u0000which are integral to visual storytelling. Automating the generation of such\u0000visual elements improves graphic designers' productivity, democratizes and\u0000innovates the creative industry, and helps generate more realistic synthetic\u0000data for related tasks. These illustration elements are mostly RGBA images with\u0000irregular shapes and cutouts, facilitating blending and scene composition.\u0000However, most image generation models are incapable of generating such images\u0000and achieving this capability requires expensive computational resources,\u0000specific training recipes, or post-processing solutions. In this work, we\u0000propose a fully-automated approach for obtaining RGBA illustrations by\u0000modifying the inference-time behavior of a pre-trained Diffusion Transformer\u0000model, exploiting the prompt-guided controllability and visual quality offered\u0000by such models with no additional computational cost. We force the generation\u0000of entire subjects without sharp croppings, whose background is easily removed\u0000for seamless integration into design projects or artistic scenes. We show with\u0000a user study that, in most cases, users prefer our solution over generating and\u0000then matting an image, and we show that our generated illustrations yield good\u0000results when used as inputs for composite scene generation pipelines. We\u0000release the code at https://github.com/aimagelab/Alfie.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming LapisGS:用于自适应流媒体的分层渐进式 3D 高斯拼接技术
Pub Date : 2024-08-27 DOI: arxiv-2408.14823
Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi
The rise of Extended Reality (XR) requires efficient streaming of 3D onlineworlds, challenging current 3DGS representations to adapt tobandwidth-constrained environments. This paper proposes LapisGS, a layered 3DGSthat supports adaptive streaming and progressive rendering. Our methodconstructs a layered structure for cumulative representation, incorporatesdynamic opacity optimization to maintain visual fidelity, and utilizesoccupancy maps to efficiently manage Gaussian splats. This proposed modeloffers a progressive representation supporting a continuous rendering qualityadapted for bandwidth-aware streaming. Extensive experiments validate theeffectiveness of our approach in balancing visual fidelity with the compactnessof the model, with up to 50.71% improvement in SSIM, 286.53% improvement inLPIPS, and 318.41% reduction in model size, and shows its potential forbandwidth-adapted 3D streaming and rendering applications.
扩展现实(XR)的兴起要求高效地流式传输 3D 在线世界,这对当前的 3DGS 表示法适应带宽受限的环境提出了挑战。本文提出了一种支持自适应流媒体和渐进式渲染的分层 3DGS LapisGS。我们的方法为累积表示构建了分层结构,结合了动态不透明度优化以保持视觉保真度,并利用占位图来有效管理高斯飞溅。该模型提供了一种渐进式表示方法,支持适合带宽感知流的连续渲染质量。广泛的实验验证了我们的方法在平衡视觉保真度和模型紧凑性方面的有效性,SSIM 提高了 50.71%,LPIPS 提高了 286.53%,模型大小减少了 318.41%,并显示了它在带宽适应型 3D 流媒体和渲染应用方面的潜力。
{"title":"LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming","authors":"Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi","doi":"arxiv-2408.14823","DOIUrl":"https://doi.org/arxiv-2408.14823","url":null,"abstract":"The rise of Extended Reality (XR) requires efficient streaming of 3D online\u0000worlds, challenging current 3DGS representations to adapt to\u0000bandwidth-constrained environments. This paper proposes LapisGS, a layered 3DGS\u0000that supports adaptive streaming and progressive rendering. Our method\u0000constructs a layered structure for cumulative representation, incorporates\u0000dynamic opacity optimization to maintain visual fidelity, and utilizes\u0000occupancy maps to efficiently manage Gaussian splats. This proposed model\u0000offers a progressive representation supporting a continuous rendering quality\u0000adapted for bandwidth-aware streaming. Extensive experiments validate the\u0000effectiveness of our approach in balancing visual fidelity with the compactness\u0000of the model, with up to 50.71% improvement in SSIM, 286.53% improvement in\u0000LPIPS, and 318.41% reduction in model size, and shows its potential for\u0000bandwidth-adapted 3D streaming and rendering applications.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1