In the domain of human-computer interaction, accurately recognizing and interpreting human emotions is crucial yet challenging due to the complexity and subtlety of emotional expressions. This study explores the potential for detecting a rich and flexible range of emotions through a multimodal approach which integrates facial expressions, voice tones, and transcript from video clips. We propose a novel framework that maps variety of emotions in a three-dimensional Valence-Arousal-Dominance (VAD) space, which could reflect the fluctuations and positivity/negativity of emotions to enable a more variety and comprehensive representation of emotional states. We employed K-means clustering to transit emotions from traditional discrete categorization to a continuous labeling system and built a classifier for emotion recognition upon this system. The effectiveness of the proposed model is evaluated using the MER2024 dataset, which contains culturally consistent video clips from Chinese movies and TV series, annotated with both discrete and open-vocabulary emotion labels. Our experiment successfully achieved the transformation between discrete and continuous models, and the proposed model generated a more diverse and comprehensive set of emotion vocabulary while maintaining strong accuracy.
{"title":"Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection","authors":"Jiehui Jia, Huan Zhang, Jinhua Liang","doi":"arxiv-2409.07901","DOIUrl":"https://doi.org/arxiv-2409.07901","url":null,"abstract":"In the domain of human-computer interaction, accurately recognizing and\u0000interpreting human emotions is crucial yet challenging due to the complexity\u0000and subtlety of emotional expressions. This study explores the potential for\u0000detecting a rich and flexible range of emotions through a multimodal approach\u0000which integrates facial expressions, voice tones, and transcript from video\u0000clips. We propose a novel framework that maps variety of emotions in a\u0000three-dimensional Valence-Arousal-Dominance (VAD) space, which could reflect\u0000the fluctuations and positivity/negativity of emotions to enable a more variety\u0000and comprehensive representation of emotional states. We employed K-means\u0000clustering to transit emotions from traditional discrete categorization to a\u0000continuous labeling system and built a classifier for emotion recognition upon\u0000this system. The effectiveness of the proposed model is evaluated using the\u0000MER2024 dataset, which contains culturally consistent video clips from Chinese\u0000movies and TV series, annotated with both discrete and open-vocabulary emotion\u0000labels. Our experiment successfully achieved the transformation between\u0000discrete and continuous models, and the proposed model generated a more diverse\u0000and comprehensive set of emotion vocabulary while maintaining strong accuracy.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study addresses the challenge of accurately segmenting 3D Gaussian Splatting from 2D masks. Conventional methods often rely on iterative gradient descent to assign each Gaussian a unique label, leading to lengthy optimization and sub-optimal solutions. Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. The core insight of our method is that, with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially a linear function with respect to the labels of each Gaussian. As such, the optimal label assignment can be solved via linear programming in closed form. This solution capitalizes on the alpha blending characteristic of the splatting process for single step optimization. By incorporating the background bias in our objective function, our method shows superior robustness in 3D segmentation against noises. Remarkably, our optimization completes within 30 seconds, about 50$times$ faster than the best existing methods. Extensive experiments demonstrate the efficiency and robustness of our method in segmenting various scenes, and its superior performance in downstream tasks such as object removal and inpainting. Demos and code will be available at https://github.com/florinshen/FlashSplat.
{"title":"FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally","authors":"Qiuhong Shen, Xingyi Yang, Xinchao Wang","doi":"arxiv-2409.08270","DOIUrl":"https://doi.org/arxiv-2409.08270","url":null,"abstract":"This study addresses the challenge of accurately segmenting 3D Gaussian\u0000Splatting from 2D masks. Conventional methods often rely on iterative gradient\u0000descent to assign each Gaussian a unique label, leading to lengthy optimization\u0000and sub-optimal solutions. Instead, we propose a straightforward yet globally\u0000optimal solver for 3D-GS segmentation. The core insight of our method is that,\u0000with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially\u0000a linear function with respect to the labels of each Gaussian. As such, the\u0000optimal label assignment can be solved via linear programming in closed form.\u0000This solution capitalizes on the alpha blending characteristic of the splatting\u0000process for single step optimization. By incorporating the background bias in\u0000our objective function, our method shows superior robustness in 3D segmentation\u0000against noises. Remarkably, our optimization completes within 30 seconds, about\u000050$times$ faster than the best existing methods. Extensive experiments\u0000demonstrate the efficiency and robustness of our method in segmenting various\u0000scenes, and its superior performance in downstream tasks such as object removal\u0000and inpainting. Demos and code will be available at\u0000https://github.com/florinshen/FlashSplat.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in 3D Gaussian Splatting (3DGS) have garnered significant attention in computer vision and computer graphics due to its high rendering speed and remarkable quality. While extant research has endeavored to extend the application of 3DGS from static to dynamic scenes, such efforts have been consistently impeded by excessive model sizes, constraints on video duration, and content deviation. These limitations significantly compromise the streamability of dynamic 3D Gaussian models, thereby restricting their utility in downstream applications, including volumetric video, autonomous vehicle, and immersive technologies such as virtual, augmented, and mixed reality. This paper introduces SwinGS, a novel framework for training, delivering, and rendering volumetric video in a real-time streaming fashion. To address the aforementioned challenges and enhance streamability, SwinGS integrates spacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model to fit various 3D scenes across frames, in the meantime employing a sliding window captures Gaussian snapshots for each frame in an accumulative way. We implement a prototype of SwinGS and demonstrate its streamability across various datasets and scenes. Additionally, we develop an interactive WebGL viewer enabling real-time volumetric video playback on most devices with modern browsers, including smartphones and tablets. Experimental results show that SwinGS reduces transmission costs by 83.6% compared to previous work with ignorable compromise in PSNR. Moreover, SwinGS easily scales to long video sequences without compromising quality.
{"title":"SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming with Arbitrary Length","authors":"Bangya Liu, Suman Banerjee","doi":"arxiv-2409.07759","DOIUrl":"https://doi.org/arxiv-2409.07759","url":null,"abstract":"Recent advances in 3D Gaussian Splatting (3DGS) have garnered significant\u0000attention in computer vision and computer graphics due to its high rendering\u0000speed and remarkable quality. While extant research has endeavored to extend\u0000the application of 3DGS from static to dynamic scenes, such efforts have been\u0000consistently impeded by excessive model sizes, constraints on video duration,\u0000and content deviation. These limitations significantly compromise the\u0000streamability of dynamic 3D Gaussian models, thereby restricting their utility\u0000in downstream applications, including volumetric video, autonomous vehicle, and\u0000immersive technologies such as virtual, augmented, and mixed reality. This paper introduces SwinGS, a novel framework for training, delivering, and\u0000rendering volumetric video in a real-time streaming fashion. To address the\u0000aforementioned challenges and enhance streamability, SwinGS integrates\u0000spacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model to\u0000fit various 3D scenes across frames, in the meantime employing a sliding window\u0000captures Gaussian snapshots for each frame in an accumulative way. We implement\u0000a prototype of SwinGS and demonstrate its streamability across various datasets\u0000and scenes. Additionally, we develop an interactive WebGL viewer enabling\u0000real-time volumetric video playback on most devices with modern browsers,\u0000including smartphones and tablets. Experimental results show that SwinGS\u0000reduces transmission costs by 83.6% compared to previous work with ignorable\u0000compromise in PSNR. Moreover, SwinGS easily scales to long video sequences\u0000without compromising quality.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image operation chain detection techniques have gained increasing attention recently in the field of multimedia forensics. However, existing detection methods suffer from the generalization problem. Moreover, the channel correlation of color images that provides additional forensic evidence is often ignored. To solve these issues, in this article, we propose a novel two-stream multi-channels fusion networks for color image operation chain detection in which the spatial artifact stream and the noise residual stream are explored in a complementary manner. Specifically, we first propose a novel deep residual architecture without pooling in the spatial artifact stream for learning the global features representation of multi-channel correlation. Then, a set of filters is designed to aggregate the correlation information of multi-channels while capturing the low-level features in the noise residual stream. Subsequently, the high-level features are extracted by the deep residual model. Finally, features from the two streams are fed into a fusion module, to effectively learn richer discriminative representations of the operation chain. Extensive experiments show that the proposed method achieves state-of-the-art generalization ability while maintaining robustness to JPEG compression. The source code used in these experiments will be released at https://github.com/LeiTan-98/TMFNet.
{"title":"TMFNet: Two-Stream Multi-Channels Fusion Networks for Color Image Operation Chain Detection","authors":"Yakun Niu, Lei Tan, Lei Zhang, Xianyu Zuo","doi":"arxiv-2409.07701","DOIUrl":"https://doi.org/arxiv-2409.07701","url":null,"abstract":"Image operation chain detection techniques have gained increasing attention\u0000recently in the field of multimedia forensics. However, existing detection\u0000methods suffer from the generalization problem. Moreover, the channel\u0000correlation of color images that provides additional forensic evidence is often\u0000ignored. To solve these issues, in this article, we propose a novel two-stream\u0000multi-channels fusion networks for color image operation chain detection in\u0000which the spatial artifact stream and the noise residual stream are explored in\u0000a complementary manner. Specifically, we first propose a novel deep residual\u0000architecture without pooling in the spatial artifact stream for learning the\u0000global features representation of multi-channel correlation. Then, a set of\u0000filters is designed to aggregate the correlation information of multi-channels\u0000while capturing the low-level features in the noise residual stream.\u0000Subsequently, the high-level features are extracted by the deep residual model.\u0000Finally, features from the two streams are fed into a fusion module, to\u0000effectively learn richer discriminative representations of the operation chain.\u0000Extensive experiments show that the proposed method achieves state-of-the-art\u0000generalization ability while maintaining robustness to JPEG compression. The\u0000source code used in these experiments will be released at\u0000https://github.com/LeiTan-98/TMFNet.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei
Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness. In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation. Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details. A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details. Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction. Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures. Source code and data are available at url{https://github.com/yanghb22-fdu/Hi3D-Official}.
尽管在图像到 3D 的生成方面取得了巨大进步,但现有方法仍难以生成具有高分辨率纹理细节的多视角一致图像,尤其是在缺乏 3D 意识的 2D 扩散范例中。在这项工作中,我们提出了高分辨率图像到三维模型(Hi3D),这是一种基于视频扩散的新范例,它将单图像到多视角图像重新定义为三维感知的连续图像生成(即轨道视频生成)。该方法深入研究了视频扩散模型中潜在的时间一致性知识,并将其很好地概括为三维生成中多视图的几何一致性。从技术上讲,Hi3Dfirst 利用三维感知先验(相机姿态条件)增强了预训练视频扩散模型的能力,从而生成具有低分辨率纹理细节的多视图图像。通过学习三维感知视频到视频细化器,可进一步放大具有高分辨率纹理细节的多视角图像。这种高分辨率多视图图像通过三维高斯拼接技术进一步增加了新视图,最后利用这些新视图通过三维重构技术获得高保真网格。对新视图合成和单视图重建的广泛实验表明,我们的 Hi3D 能够生成具有高精细纹理的超多视图一致性图像。源代码和数据可在 url{https://github.com/yanghb22-fdu/Hi3D-Official} 上获取。
{"title":"Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models","authors":"Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei","doi":"arxiv-2409.07452","DOIUrl":"https://doi.org/arxiv-2409.07452","url":null,"abstract":"Despite having tremendous progress in image-to-3D generation, existing\u0000methods still struggle to produce multi-view consistent images with\u0000high-resolution textures in detail, especially in the paradigm of 2D diffusion\u0000that lacks 3D awareness. In this work, we present High-resolution Image-to-3D\u0000model (Hi3D), a new video diffusion based paradigm that redefines a single\u0000image to multi-view images as 3D-aware sequential image generation (i.e.,\u0000orbital video generation). This methodology delves into the underlying temporal\u0000consistency knowledge in video diffusion model that generalizes well to\u0000geometry consistency across multiple views in 3D generation. Technically, Hi3D\u0000first empowers the pre-trained video diffusion model with 3D-aware prior\u0000(camera pose condition), yielding multi-view images with low-resolution texture\u0000details. A 3D-aware video-to-video refiner is learnt to further scale up the\u0000multi-view images with high-resolution texture details. Such high-resolution\u0000multi-view images are further augmented with novel views through 3D Gaussian\u0000Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D\u0000reconstruction. Extensive experiments on both novel view synthesis and single\u0000view reconstruction demonstrate that our Hi3D manages to produce superior\u0000multi-view consistency images with highly-detailed textures. Source code and\u0000data are available at url{https://github.com/yanghb22-fdu/Hi3D-Official}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei
Learning radiance fields (NeRF) with powerful 2D diffusion models has garnered popularity for text-to-3D generation. Nevertheless, the implicit 3D representations of NeRF lack explicit modeling of meshes and textures over surfaces, and such surface-undefined way may suffer from the issues, e.g., noisy surfaces with ambiguous texture details or cross-view inconsistency. To alleviate this, we present DreamMesh, a novel text-to-3D architecture that pivots on well-defined surfaces (triangle meshes) to generate high-fidelity explicit 3D model. Technically, DreamMesh capitalizes on a distinctive coarse-to-fine scheme. In the coarse stage, the mesh is first deformed by text-guided Jacobians and then DreamMesh textures the mesh with an interlaced use of 2D diffusion models in a tuning free manner from multiple viewpoints. In the fine stage, DreamMesh jointly manipulates the mesh and refines the texture map, leading to high-quality triangle meshes with high-fidelity textured materials. Extensive experiments demonstrate that DreamMesh significantly outperforms state-of-the-art text-to-3D methods in faithfully generating 3D content with richer textual details and enhanced geometry. Our project page is available at https://dreammesh.github.io.
学习辐射场(NeRF)具有强大的二维扩散模型,在文本到三维的生成中颇受欢迎。然而,NeRF 的隐式 3D 表示缺乏对网格和表面纹理的显式建模,而且这种未定义表面的方式可能会出现一些问题,例如纹理细节模糊或跨视角不一致的嘈杂表面。为了解决这些问题,我们提出了 DreamMesh,这是一种新颖的文本到三维架构,它以定义明确的曲面(三角形网格)为中心,生成高保真的三维模型。从技术上讲,DreamMesh 采用了一种独特的从粗到细的方案。在粗略阶段,首先通过文本引导的雅各布因子对网格进行变形,然后 DreamMesh 从多个视角以自由调整的方式交错使用二维扩散模型对网格进行纹理处理。在精细阶段,DreamMesh 对网格进行联合处理,并完善纹理贴图,从而生成具有高保真纹理材质的高质量三角形网格。大量实验证明,DreamMesh 在忠实生成具有更丰富文本细节和增强几何形状的 3D 内容方面,明显优于最先进的文本到 3D 方法。我们的项目页面位于 https://dreammesh.github.io。
{"title":"DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation","authors":"Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei","doi":"arxiv-2409.07454","DOIUrl":"https://doi.org/arxiv-2409.07454","url":null,"abstract":"Learning radiance fields (NeRF) with powerful 2D diffusion models has\u0000garnered popularity for text-to-3D generation. Nevertheless, the implicit 3D\u0000representations of NeRF lack explicit modeling of meshes and textures over\u0000surfaces, and such surface-undefined way may suffer from the issues, e.g.,\u0000noisy surfaces with ambiguous texture details or cross-view inconsistency. To\u0000alleviate this, we present DreamMesh, a novel text-to-3D architecture that\u0000pivots on well-defined surfaces (triangle meshes) to generate high-fidelity\u0000explicit 3D model. Technically, DreamMesh capitalizes on a distinctive\u0000coarse-to-fine scheme. In the coarse stage, the mesh is first deformed by\u0000text-guided Jacobians and then DreamMesh textures the mesh with an interlaced\u0000use of 2D diffusion models in a tuning free manner from multiple viewpoints. In\u0000the fine stage, DreamMesh jointly manipulates the mesh and refines the texture\u0000map, leading to high-quality triangle meshes with high-fidelity textured\u0000materials. Extensive experiments demonstrate that DreamMesh significantly\u0000outperforms state-of-the-art text-to-3D methods in faithfully generating 3D\u0000content with richer textual details and enhanced geometry. Our project page is\u0000available at https://dreammesh.github.io.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei
The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance the generated images nevertheless is not trivial and necessitates to delicately enrich plentiful details while preserving the visual appearance of key content in the original image. In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to denoise and enhance the image details. In the noising stage, FreeEnhance is devised to add lighter noise to the region with higher frequency to preserve the high-frequent patterns (e.g., edge, corner) in the original image. In the denoising stage, we present three target properties as constraints to regularize the predicted noise, enhancing images with high acutance and high visual quality. Extensive experiments conducted on the HPDv2 dataset demonstrate that our FreeEnhance outperforms the state-of-the-art image enhancement models in terms of quantitative metrics and human preference. More remarkably, FreeEnhance also shows higher human preference compared to the commercial image enhancement solution of Magnific AI.
{"title":"FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process","authors":"Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei","doi":"arxiv-2409.07451","DOIUrl":"https://doi.org/arxiv-2409.07451","url":null,"abstract":"The emergence of text-to-image generation models has led to the recognition\u0000that image enhancement, performed as post-processing, would significantly\u0000improve the visual quality of the generated images. Exploring diffusion models\u0000to enhance the generated images nevertheless is not trivial and necessitates to\u0000delicately enrich plentiful details while preserving the visual appearance of\u0000key content in the original image. In this paper, we propose a novel framework,\u0000namely FreeEnhance, for content-consistent image enhancement using the\u0000off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage\u0000process that firstly adds random noise to the input image and then capitalizes\u0000on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to\u0000denoise and enhance the image details. In the noising stage, FreeEnhance is\u0000devised to add lighter noise to the region with higher frequency to preserve\u0000the high-frequent patterns (e.g., edge, corner) in the original image. In the\u0000denoising stage, we present three target properties as constraints to\u0000regularize the predicted noise, enhancing images with high acutance and high\u0000visual quality. Extensive experiments conducted on the HPDv2 dataset\u0000demonstrate that our FreeEnhance outperforms the state-of-the-art image\u0000enhancement models in terms of quantitative metrics and human preference. More\u0000remarkably, FreeEnhance also shows higher human preference compared to the\u0000commercial image enhancement solution of Magnific AI.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon
Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale `in-the-wild' dataset for identifying human perceptions about the `Most Important Person (MIP)' in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild' situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at https://github.com/surbhimadan92/MIP-GAF.
{"title":"MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding","authors":"Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon","doi":"arxiv-2409.06224","DOIUrl":"https://doi.org/arxiv-2409.06224","url":null,"abstract":"Estimating the Most Important Person (MIP) in any social event setup is a\u0000challenging problem mainly due to contextual complexity and scarcity of labeled\u0000data. Moreover, the causality aspects of MIP estimation are quite subjective\u0000and diverse. To this end, we aim to address the problem by annotating a\u0000large-scale `in-the-wild' dataset for identifying human perceptions about the\u0000`Most Important Person (MIP)' in an image. The paper provides a thorough\u0000description of our proposed Multimodal Large Language Model (MLLM) based data\u0000annotation strategy, and a thorough data quality analysis. Further, we perform\u0000a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art\u0000MIP localization methods, indicating a significant drop in performance compared\u0000to existing datasets. The performance drop shows that the existing MIP\u0000localization algorithms must be more robust with respect to `in-the-wild'\u0000situations. We believe the proposed dataset will play a vital role in building\u0000the next-generation social situation understanding methods. The code and data\u0000is available at https://github.com/surbhimadan92/MIP-GAF.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Very low-resolution face recognition is challenging due to the serious loss of informative facial details in resolution degradation. In this paper, we propose a generative-discriminative representation distillation approach that combines generative representation with cross-resolution aligned knowledge distillation. This approach facilitates very low-resolution face recognition by jointly distilling generative and discriminative models via two distillation modules. Firstly, the generative representation distillation takes the encoder of a diffusion model pretrained for face super-resolution as the generative teacher to supervise the learning of the student backbone via feature regression, and then freezes the student backbone. After that, the discriminative representation distillation further considers a pretrained face recognizer as the discriminative teacher to supervise the learning of the student head via cross-resolution relational contrastive distillation. In this way, the general backbone representation can be transformed into discriminative head representation, leading to a robust and discriminative student model for very low-resolution face recognition. Our approach improves the recovery of the missing details in very low-resolution faces and achieves better knowledge transfer. Extensive experiments on face datasets demonstrate that our approach enhances the recognition accuracy of very low-resolution faces, showcasing its effectiveness and adaptability.
{"title":"Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition","authors":"Junzheng Zhang, Weijia Guo, Bochao Liu, Ruixin Shi, Yong Li, Shiming Ge","doi":"arxiv-2409.06371","DOIUrl":"https://doi.org/arxiv-2409.06371","url":null,"abstract":"Very low-resolution face recognition is challenging due to the serious loss\u0000of informative facial details in resolution degradation. In this paper, we\u0000propose a generative-discriminative representation distillation approach that\u0000combines generative representation with cross-resolution aligned knowledge\u0000distillation. This approach facilitates very low-resolution face recognition by\u0000jointly distilling generative and discriminative models via two distillation\u0000modules. Firstly, the generative representation distillation takes the encoder\u0000of a diffusion model pretrained for face super-resolution as the generative\u0000teacher to supervise the learning of the student backbone via feature\u0000regression, and then freezes the student backbone. After that, the\u0000discriminative representation distillation further considers a pretrained face\u0000recognizer as the discriminative teacher to supervise the learning of the\u0000student head via cross-resolution relational contrastive distillation. In this\u0000way, the general backbone representation can be transformed into discriminative\u0000head representation, leading to a robust and discriminative student model for\u0000very low-resolution face recognition. Our approach improves the recovery of the\u0000missing details in very low-resolution faces and achieves better knowledge\u0000transfer. Extensive experiments on face datasets demonstrate that our approach\u0000enhances the recognition accuracy of very low-resolution faces, showcasing its\u0000effectiveness and adaptability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanyi He, Peng Yang, Tian Qin, Jiawei Hou, Ning Zhang
In this paper, we explore adaptive offloading and enhancement strategies for video analytics tasks on computing-constrained mobile devices in low-light conditions. We observe that the accuracy of low-light video analytics varies from different enhancement algorithms. The root cause could be the disparities in the effectiveness of enhancement algorithms for feature extraction in analytic models. Specifically, the difference in class activation maps (CAMs) between enhanced and low-light frames demonstrates a positive correlation with video analytics accuracy. Motivated by such observations, a novel enhancement quality assessment method is proposed on CAMs to evaluate the effectiveness of different enhancement algorithms for low-light videos. Then, we design a multi-edge system, which adaptively offloads and enhances low-light video analytics tasks from mobile devices. To achieve the trade-off between the enhancement quality and the latency for all system-served mobile devices, we propose a genetic-based scheduling algorithm, which can find a near-optimal solution in a reasonable time to meet the latency requirement. Thereby, the offloading strategies and the enhancement algorithms are properly selected under the condition of limited end-edge bandwidth and edge computation resources. Simulation experiments demonstrate the superiority of the proposed system, improving accuracy up to 20.83% compared to existing benchmarks.
{"title":"Adaptive Offloading and Enhancement for Low-Light Video Analytics on Mobile Devices","authors":"Yuanyi He, Peng Yang, Tian Qin, Jiawei Hou, Ning Zhang","doi":"arxiv-2409.05297","DOIUrl":"https://doi.org/arxiv-2409.05297","url":null,"abstract":"In this paper, we explore adaptive offloading and enhancement strategies for\u0000video analytics tasks on computing-constrained mobile devices in low-light\u0000conditions. We observe that the accuracy of low-light video analytics varies\u0000from different enhancement algorithms. The root cause could be the disparities\u0000in the effectiveness of enhancement algorithms for feature extraction in\u0000analytic models. Specifically, the difference in class activation maps (CAMs)\u0000between enhanced and low-light frames demonstrates a positive correlation with\u0000video analytics accuracy. Motivated by such observations, a novel enhancement\u0000quality assessment method is proposed on CAMs to evaluate the effectiveness of\u0000different enhancement algorithms for low-light videos. Then, we design a\u0000multi-edge system, which adaptively offloads and enhances low-light video\u0000analytics tasks from mobile devices. To achieve the trade-off between the\u0000enhancement quality and the latency for all system-served mobile devices, we\u0000propose a genetic-based scheduling algorithm, which can find a near-optimal\u0000solution in a reasonable time to meet the latency requirement. Thereby, the\u0000offloading strategies and the enhancement algorithms are properly selected\u0000under the condition of limited end-edge bandwidth and edge computation\u0000resources. Simulation experiments demonstrate the superiority of the proposed\u0000system, improving accuracy up to 20.83% compared to existing benchmarks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}