Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
Estimating the momentary level of participant's engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate'24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate'24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate'24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.
{"title":"MultiMediate'24: Multi-Domain Engagement Estimation","authors":"Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling","doi":"arxiv-2408.16625","DOIUrl":"https://doi.org/arxiv-2408.16625","url":null,"abstract":"Estimating the momentary level of participant's engagement is an important\u0000prerequisite for assistive systems that support human interactions. Previous\u0000work has addressed this task in within-domain evaluation scenarios, i.e.\u0000training and testing on the same dataset. This is in contrast to real-life\u0000scenarios where domain shifts between training and testing data frequently\u0000occur. With MultiMediate'24, we present the first challenge addressing\u0000multi-domain engagement estimation. As training data, we utilise the NOXI\u0000database of dyadic novice-expert interactions. In addition to within-domain\u0000test data, we add two new test domains. First, we introduce recordings\u0000following the NOXI protocol but covering languages that are not present in the\u0000NOXI training data. Second, we collected novel engagement annotations on the\u0000MPIIGroupInteraction dataset which consists of group discussions between three\u0000to four people. In this way, MultiMediate'24 evaluates the ability of\u0000approaches to generalise across factors such as language and cultural\u0000background, group size, task, and screen-mediated vs. face-to-face interaction.\u0000This paper describes the MultiMediate'24 challenge and presents baseline\u0000results. In addition, we discuss selected challenge solutions.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at https://github.com/Aman-4-Real/See-or-Guess.
{"title":"See or Guess: Counterfactually Regularized Image Captioning","authors":"Qian Cao, Xu Chen, Ruihua Song, Xiting Wang, Xinting Huang, Yuchen Ren","doi":"arxiv-2408.16809","DOIUrl":"https://doi.org/arxiv-2408.16809","url":null,"abstract":"Image captioning, which generates natural language descriptions of the visual\u0000information in an image, is a crucial task in vision-language research.\u0000Previous models have typically addressed this task by aligning the generative\u0000capabilities of machines with human intelligence through statistical fitting of\u0000existing datasets. While effective for normal images, they may struggle to\u0000accurately describe those where certain parts of the image are obscured or\u0000edited, unlike humans who excel in such cases. These weaknesses they exhibit,\u0000including hallucinations and limited interpretability, often hinder performance\u0000in scenarios with shifted association patterns. In this paper, we present a\u0000generic image captioning framework that employs causal inference to make\u0000existing models more capable of interventional tasks, and counterfactually\u0000explainable. Our approach includes two variants leveraging either total effect\u0000or natural direct effect. Integrating them into the training process enables\u0000models to handle counterfactual scenarios, increasing their generalizability.\u0000Extensive experiments on various datasets show that our method effectively\u0000reduces hallucinations and improves the model's faithfulness to images,\u0000demonstrating high portability across both small-scale and large-scale\u0000image-to-text models. The code is available at\u0000https://github.com/Aman-4-Real/See-or-Guess.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nasim Jamshidi Avanaki, Abhijay Ghildiyal, Nabajeet Barman, Saman Zadtootaghaj
No-Reference Image Quality Assessment (NR-IQA) remains a challenging task due to the diversity of distortions and the lack of large annotated datasets. Many studies have attempted to tackle these challenges by developing more accurate NR-IQA models, often employing complex and computationally expensive networks, or by bridging the domain gap between various distortions to enhance performance on test datasets. In our work, we improve the performance of a generic lightweight NR-IQA model by introducing a novel augmentation strategy that boosts its performance by almost 28%. This augmentation strategy enables the network to better discriminate between different distortions in various parts of the image by zooming in and out. Additionally, the inclusion of test-time augmentation further enhances performance, making our lightweight network's results comparable to the current state-of-the-art models, simply through the use of augmentations.
{"title":"MSLIQA: Enhancing Learning Representations for Image Quality Assessment through Multi-Scale Learning","authors":"Nasim Jamshidi Avanaki, Abhijay Ghildiyal, Nabajeet Barman, Saman Zadtootaghaj","doi":"arxiv-2408.16879","DOIUrl":"https://doi.org/arxiv-2408.16879","url":null,"abstract":"No-Reference Image Quality Assessment (NR-IQA) remains a challenging task due\u0000to the diversity of distortions and the lack of large annotated datasets. Many\u0000studies have attempted to tackle these challenges by developing more accurate\u0000NR-IQA models, often employing complex and computationally expensive networks,\u0000or by bridging the domain gap between various distortions to enhance\u0000performance on test datasets. In our work, we improve the performance of a\u0000generic lightweight NR-IQA model by introducing a novel augmentation strategy\u0000that boosts its performance by almost 28%. This augmentation strategy enables\u0000the network to better discriminate between different distortions in various\u0000parts of the image by zooming in and out. Additionally, the inclusion of\u0000test-time augmentation further enhances performance, making our lightweight\u0000network's results comparable to the current state-of-the-art models, simply\u0000through the use of augmentations.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son
Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.
{"title":"A Simple Baseline with Single-encoder for Referring Image Segmentation","authors":"Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son","doi":"arxiv-2408.15521","DOIUrl":"https://doi.org/arxiv-2408.15521","url":null,"abstract":"Referring image segmentation (RIS) requires dense vision-language\u0000interactions between visual pixels and textual words to segment objects based\u0000on a given description. However, commonly adapted dual-encoders in RIS, e.g.,\u0000Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal\u0000dual-encoder), lack dense multi-modal interactions during pre-training, leading\u0000to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods\u0000often rely on multi-modal fusion modules that interact two encoders, but this\u0000approach leads to high computational costs. In this paper, we present a novel\u0000RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of\u0000shared self-attention across all framework components. This enables seamless\u0000interactions of two modalities from input to final prediction, producing\u0000granularly aligned multi-modal features. Furthermore, we propose lightweight\u0000yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which\u0000contribute to the high efficiency of our model. Our simple baseline with a\u0000single encoder achieves outstanding performances on the RIS benchmark datasets\u0000while maintaining computational efficiency, compared to the most recent SoTA\u0000methods based on dual-encoders.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu
Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.
{"title":"Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input","authors":"Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu","doi":"arxiv-2408.15542","DOIUrl":"https://doi.org/arxiv-2408.15542","url":null,"abstract":"Rapid advancements have been made in extending Large Language Models (LLMs)\u0000to Large Multi-modal Models (LMMs). However, extending input modality of LLMs\u0000to video data remains a challenging endeavor, especially for long videos. Due\u0000to insufficient access to large-scale high-quality video data and the excessive\u0000compression of visual features, current methods exhibit limitations in\u0000effectively processing long videos. In this paper, we introduce Kangaroo, a\u0000powerful Video LMM aimed at addressing these challenges. Confronted with issue\u0000of inadequate training data, we develop a data curation system to build a\u0000large-scale dataset with high-quality annotations for vision-language\u0000pre-training and instruction tuning. In addition, we design a curriculum\u0000training pipeline with gradually increasing resolution and number of input\u0000frames to accommodate long videos. Evaluation results demonstrate that, with 8B\u0000parameters, Kangaroo achieves state-of-the-art performance across a variety of\u0000video understanding benchmarks while exhibiting competitive results on others.\u0000Particularly, on benchmarks specialized for long videos, Kangaroo excels some\u0000larger models with over 10B parameters and proprietary models.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.
{"title":"Hand1000: Generating Realistic Hands from Text with Only 1,000 Images","authors":"Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao","doi":"arxiv-2408.15461","DOIUrl":"https://doi.org/arxiv-2408.15461","url":null,"abstract":"Text-to-image generation models have achieved remarkable advancements in\u0000recent years, aiming to produce realistic images from textual descriptions.\u0000However, these models often struggle with generating anatomically accurate\u0000representations of human hands. The resulting images frequently exhibit issues\u0000such as incorrect numbers of fingers, unnatural twisting or interlacing of\u0000fingers, or blurred and indistinct hands. These issues stem from the inherent\u0000complexity of hand structures and the difficulty in aligning textual\u0000descriptions with precise visual depictions of hands. To address these\u0000challenges, we propose a novel approach named Hand1000 that enables the\u0000generation of realistic hand images with target gesture using only 1,000\u0000training samples. The training of Hand1000 is divided into three stages with\u0000the first stage aiming to enhance the model's understanding of hand anatomy by\u0000using a pre-trained hand gesture recognition model to extract gesture\u0000representation. The second stage further optimizes text embedding by\u0000incorporating the extracted hand gesture representation, to improve alignment\u0000between the textual descriptions and the generated hand images. The third stage\u0000utilizes the optimized embedding to fine-tune the Stable Diffusion model to\u0000generate realistic hand images. In addition, we construct the first publicly\u0000available dataset specifically designed for text-to-hand image generation.\u0000Based on the existing hand gesture recognition dataset, we adopt advanced image\u0000captioning models and LLaMA3 to generate high-quality textual descriptions\u0000enriched with detailed gesture information. Extensive experiments demonstrate\u0000that Hand1000 significantly outperforms existing models in producing\u0000anatomically correct hand images while faithfully representing other details in\u0000the text, such as faces, clothing, and colors.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video-based apparent affect detection plays a crucial role in video understanding, as it encompasses various elements such as vision, audio, audio-visual interactions, and spatiotemporal information, which are essential for accurate video predictions. However, existing approaches often focus on extracting only a subset of these elements, resulting in the limited predictive capacity of their models. To address this limitation, we propose a novel LSTM-based network augmented with a Transformer co-attention mechanism for predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec Co-attention Transformer surpasses multiple state-of-the-art methods in predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First Impressions. Notably, our model offers interpretability, allowing us to examine the contributions of different time points to the overall prediction. The implementation is available at: https://github.com/nestor-sun/sec2sec.
{"title":"Sec2Sec Co-attention for Video-Based Apparent Affective Prediction","authors":"Mingwei Sun, Kunpeng Zhang","doi":"arxiv-2408.15209","DOIUrl":"https://doi.org/arxiv-2408.15209","url":null,"abstract":"Video-based apparent affect detection plays a crucial role in video\u0000understanding, as it encompasses various elements such as vision, audio,\u0000audio-visual interactions, and spatiotemporal information, which are essential\u0000for accurate video predictions. However, existing approaches often focus on\u0000extracting only a subset of these elements, resulting in the limited predictive\u0000capacity of their models. To address this limitation, we propose a novel\u0000LSTM-based network augmented with a Transformer co-attention mechanism for\u0000predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec\u0000Co-attention Transformer surpasses multiple state-of-the-art methods in\u0000predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First\u0000Impressions. Notably, our model offers interpretability, allowing us to examine\u0000the contributions of different time points to the overall prediction. The\u0000implementation is available at: https://github.com/nestor-sun/sec2sec.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline's capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.
{"title":"SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding","authors":"Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan","doi":"arxiv-2408.14764","DOIUrl":"https://doi.org/arxiv-2408.14764","url":null,"abstract":"This paper introduces SynthDoc, a novel synthetic document generation\u0000pipeline designed to enhance Visual Document Understanding (VDU) by generating\u0000high-quality, diverse datasets that include text, images, tables, and charts.\u0000Addressing the challenges of data acquisition and the limitations of existing\u0000datasets, SynthDoc leverages publicly available corpora and advanced rendering\u0000tools to create a comprehensive and versatile dataset. Our experiments,\u0000conducted using the Donut model, demonstrate that models trained with\u0000SynthDoc's data achieve superior performance in pre-training read tasks and\u0000maintain robustness in downstream tasks, despite language inconsistencies. The\u0000release of a benchmark dataset comprising 5,000 image-text pairs not only\u0000showcases the pipeline's capabilities but also provides a valuable resource for\u0000the VDU community to advance research and development in document image\u0000recognition. This work significantly contributes to the field by offering a\u0000scalable solution to data scarcity and by validating the efficacy of end-to-end\u0000models in parsing complex, real-world documents.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at https://github.com/aimagelab/Alfie.
{"title":"Alfie: Democratising RGBA Image Generation With No $$$","authors":"Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara","doi":"arxiv-2408.14826","DOIUrl":"https://doi.org/arxiv-2408.14826","url":null,"abstract":"Designs and artworks are ubiquitous across various creative fields, requiring\u0000graphic design skills and dedicated software to create compositions that\u0000include many graphical elements, such as logos, icons, symbols, and art scenes,\u0000which are integral to visual storytelling. Automating the generation of such\u0000visual elements improves graphic designers' productivity, democratizes and\u0000innovates the creative industry, and helps generate more realistic synthetic\u0000data for related tasks. These illustration elements are mostly RGBA images with\u0000irregular shapes and cutouts, facilitating blending and scene composition.\u0000However, most image generation models are incapable of generating such images\u0000and achieving this capability requires expensive computational resources,\u0000specific training recipes, or post-processing solutions. In this work, we\u0000propose a fully-automated approach for obtaining RGBA illustrations by\u0000modifying the inference-time behavior of a pre-trained Diffusion Transformer\u0000model, exploiting the prompt-guided controllability and visual quality offered\u0000by such models with no additional computational cost. We force the generation\u0000of entire subjects without sharp croppings, whose background is easily removed\u0000for seamless integration into design projects or artistic scenes. We show with\u0000a user study that, in most cases, users prefer our solution over generating and\u0000then matting an image, and we show that our generated illustrations yield good\u0000results when used as inputs for composite scene generation pipelines. We\u0000release the code at https://github.com/aimagelab/Alfie.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rise of Extended Reality (XR) requires efficient streaming of 3D online worlds, challenging current 3DGS representations to adapt to bandwidth-constrained environments. This paper proposes LapisGS, a layered 3DGS that supports adaptive streaming and progressive rendering. Our method constructs a layered structure for cumulative representation, incorporates dynamic opacity optimization to maintain visual fidelity, and utilizes occupancy maps to efficiently manage Gaussian splats. This proposed model offers a progressive representation supporting a continuous rendering quality adapted for bandwidth-aware streaming. Extensive experiments validate the effectiveness of our approach in balancing visual fidelity with the compactness of the model, with up to 50.71% improvement in SSIM, 286.53% improvement in LPIPS, and 318.41% reduction in model size, and shows its potential for bandwidth-adapted 3D streaming and rendering applications.
扩展现实(XR)的兴起要求高效地流式传输 3D 在线世界,这对当前的 3DGS 表示法适应带宽受限的环境提出了挑战。本文提出了一种支持自适应流媒体和渐进式渲染的分层 3DGS LapisGS。我们的方法为累积表示构建了分层结构,结合了动态不透明度优化以保持视觉保真度,并利用占位图来有效管理高斯飞溅。该模型提供了一种渐进式表示方法,支持适合带宽感知流的连续渲染质量。广泛的实验验证了我们的方法在平衡视觉保真度和模型紧凑性方面的有效性,SSIM 提高了 50.71%,LPIPS 提高了 286.53%,模型大小减少了 318.41%,并显示了它在带宽适应型 3D 流媒体和渲染应用方面的潜力。
{"title":"LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming","authors":"Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi","doi":"arxiv-2408.14823","DOIUrl":"https://doi.org/arxiv-2408.14823","url":null,"abstract":"The rise of Extended Reality (XR) requires efficient streaming of 3D online\u0000worlds, challenging current 3DGS representations to adapt to\u0000bandwidth-constrained environments. This paper proposes LapisGS, a layered 3DGS\u0000that supports adaptive streaming and progressive rendering. Our method\u0000constructs a layered structure for cumulative representation, incorporates\u0000dynamic opacity optimization to maintain visual fidelity, and utilizes\u0000occupancy maps to efficiently manage Gaussian splats. This proposed model\u0000offers a progressive representation supporting a continuous rendering quality\u0000adapted for bandwidth-aware streaming. Extensive experiments validate the\u0000effectiveness of our approach in balancing visual fidelity with the compactness\u0000of the model, with up to 50.71% improvement in SSIM, 286.53% improvement in\u0000LPIPS, and 318.41% reduction in model size, and shows its potential for\u0000bandwidth-adapted 3D streaming and rendering applications.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}