arXiv - CS - Multimedia最新文献_第3页

Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection 连接离散与连续：复杂情绪检测的多模式策略

arXiv - CS - Multimedia

Pub Date : 2024-09-12 DOI: arxiv-2409.07901

Jiehui Jia, Huan Zhang, Jinhua Liang

In the domain of human-computer interaction, accurately recognizing andinterpreting human emotions is crucial yet challenging due to the complexityand subtlety of emotional expressions. This study explores the potential fordetecting a rich and flexible range of emotions through a multimodal approachwhich integrates facial expressions, voice tones, and transcript from videoclips. We propose a novel framework that maps variety of emotions in athree-dimensional Valence-Arousal-Dominance (VAD) space, which could reflectthe fluctuations and positivity/negativity of emotions to enable a more varietyand comprehensive representation of emotional states. We employed K-meansclustering to transit emotions from traditional discrete categorization to acontinuous labeling system and built a classifier for emotion recognition uponthis system. The effectiveness of the proposed model is evaluated using theMER2024 dataset, which contains culturally consistent video clips from Chinesemovies and TV series, annotated with both discrete and open-vocabulary emotionlabels. Our experiment successfully achieved the transformation betweendiscrete and continuous models, and the proposed model generated a more diverseand comprehensive set of emotion vocabulary while maintaining strong accuracy.

在人机交互领域，由于情绪表达的复杂性和微妙性，准确识别和解读人类情绪至关重要，但也极具挑战性。本研究探索了通过多模态方法检测丰富而灵活的情绪的潜力，该方法整合了面部表情、声调和视频片段的文字记录。我们提出了一个新颖的框架，该框架将各种情绪映射到三维的 "情绪-焦虑-主导性"（VAD）空间中，该空间可以反映情绪的波动和积极/消极性，从而能够更多样、更全面地呈现情绪状态。我们采用 K-means 聚类将情绪从传统的离散分类转为连续标记系统，并在此系统上建立了一个情绪识别分类器。该数据集包含中国电影和电视剧中具有文化一致性的视频片段，并标注了离散和开放词汇的情感标签。我们的实验成功地实现了离散模型和连续模型之间的转换，所提出的模型在保持较高准确率的同时生成了更加多样化和全面的情感词汇集。

{"title":"Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection","authors":"Jiehui Jia, Huan Zhang, Jinhua Liang","doi":"arxiv-2409.07901","DOIUrl":"https://doi.org/arxiv-2409.07901","url":null,"abstract":"In the domain of human-computer interaction, accurately recognizing and\u0000interpreting human emotions is crucial yet challenging due to the complexity\u0000and subtlety of emotional expressions. This study explores the potential for\u0000detecting a rich and flexible range of emotions through a multimodal approach\u0000which integrates facial expressions, voice tones, and transcript from video\u0000clips. We propose a novel framework that maps variety of emotions in a\u0000three-dimensional Valence-Arousal-Dominance (VAD) space, which could reflect\u0000the fluctuations and positivity/negativity of emotions to enable a more variety\u0000and comprehensive representation of emotional states. We employed K-means\u0000clustering to transit emotions from traditional discrete categorization to a\u0000continuous labeling system and built a classifier for emotion recognition upon\u0000this system. The effectiveness of the proposed model is evaluated using the\u0000MER2024 dataset, which contains culturally consistent video clips from Chinese\u0000movies and TV series, annotated with both discrete and open-vocabulary emotion\u0000labels. Our experiment successfully achieved the transformation between\u0000discrete and continuous models, and the proposed model generated a more diverse\u0000and comprehensive set of emotion vocabulary while maintaining strong accuracy.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally FlashSplat：优化解决二维到三维高斯拼接分割问题

arXiv - CS - Multimedia

Pub Date : 2024-09-12 DOI: arxiv-2409.08270

Qiuhong Shen, Xingyi Yang, Xinchao Wang

This study addresses the challenge of accurately segmenting 3D GaussianSplatting from 2D masks. Conventional methods often rely on iterative gradientdescent to assign each Gaussian a unique label, leading to lengthy optimizationand sub-optimal solutions. Instead, we propose a straightforward yet globallyoptimal solver for 3D-GS segmentation. The core insight of our method is that,with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentiallya linear function with respect to the labels of each Gaussian. As such, theoptimal label assignment can be solved via linear programming in closed form.This solution capitalizes on the alpha blending characteristic of the splattingprocess for single step optimization. By incorporating the background bias inour objective function, our method shows superior robustness in 3D segmentationagainst noises. Remarkably, our optimization completes within 30 seconds, about50$times$ faster than the best existing methods. Extensive experimentsdemonstrate the efficiency and robustness of our method in segmenting variousscenes, and its superior performance in downstream tasks such as object removaland inpainting. Demos and code will be available athttps://github.com/florinshen/FlashSplat.

本研究解决了从二维掩膜中准确分割三维高斯拼接的难题。传统方法通常依赖迭代梯度下降来为每个高斯分配唯一的标签，从而导致冗长的优化和次优解决方案。相反，我们为 3D-GS 分割提出了一种直接但全局最优的求解方法。我们方法的核心观点是，在重建的 3D-GS 场景中，2D 掩膜的渲染基本上是与每个高斯的标签相关的线性函数。因此，最佳标签分配可以通过封闭形式的线性规划来解决。这种解决方案利用了拼接过程的阿尔法混合特性，实现了单步优化。通过将背景偏差纳入我们的目标函数，我们的方法在三维分割中表现出卓越的抗噪声鲁棒性。值得注意的是，我们的优化在 30 秒内完成，比现有最好的方法快约 50 倍。广泛的实验证明了我们的方法在分割各种场景时的效率和鲁棒性，以及在对象移除和内绘等下游任务中的卓越性能。演示和代码可在https://github.com/florinshen/FlashSplat。

{"title":"FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally","authors":"Qiuhong Shen, Xingyi Yang, Xinchao Wang","doi":"arxiv-2409.08270","DOIUrl":"https://doi.org/arxiv-2409.08270","url":null,"abstract":"This study addresses the challenge of accurately segmenting 3D Gaussian\u0000Splatting from 2D masks. Conventional methods often rely on iterative gradient\u0000descent to assign each Gaussian a unique label, leading to lengthy optimization\u0000and sub-optimal solutions. Instead, we propose a straightforward yet globally\u0000optimal solver for 3D-GS segmentation. The core insight of our method is that,\u0000with a reconstructed 3D-GS scene, the rendering of the 2D masks is essentially\u0000a linear function with respect to the labels of each Gaussian. As such, the\u0000optimal label assignment can be solved via linear programming in closed form.\u0000This solution capitalizes on the alpha blending characteristic of the splatting\u0000process for single step optimization. By incorporating the background bias in\u0000our objective function, our method shows superior robustness in 3D segmentation\u0000against noises. Remarkably, our optimization completes within 30 seconds, about\u000050$times$ faster than the best existing methods. Extensive experiments\u0000demonstrate the efficiency and robustness of our method in segmenting various\u0000scenes, and its superior performance in downstream tasks such as object removal\u0000and inpainting. Demos and code will be available at\u0000https://github.com/florinshen/FlashSplat.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming with Arbitrary Length SwinGS：用于任意长度体积视频流的滑动窗口高斯拼接技术

arXiv - CS - Multimedia

Pub Date : 2024-09-12 DOI: arxiv-2409.07759

Bangya Liu, Suman Banerjee

Recent advances in 3D Gaussian Splatting (3DGS) have garnered significantattention in computer vision and computer graphics due to its high renderingspeed and remarkable quality. While extant research has endeavored to extendthe application of 3DGS from static to dynamic scenes, such efforts have beenconsistently impeded by excessive model sizes, constraints on video duration,and content deviation. These limitations significantly compromise thestreamability of dynamic 3D Gaussian models, thereby restricting their utilityin downstream applications, including volumetric video, autonomous vehicle, andimmersive technologies such as virtual, augmented, and mixed reality. This paper introduces SwinGS, a novel framework for training, delivering, andrendering volumetric video in a real-time streaming fashion. To address theaforementioned challenges and enhance streamability, SwinGS integratesspacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model tofit various 3D scenes across frames, in the meantime employing a sliding windowcaptures Gaussian snapshots for each frame in an accumulative way. We implementa prototype of SwinGS and demonstrate its streamability across various datasetsand scenes. Additionally, we develop an interactive WebGL viewer enablingreal-time volumetric video playback on most devices with modern browsers,including smartphones and tablets. Experimental results show that SwinGSreduces transmission costs by 83.6% compared to previous work with ignorablecompromise in PSNR. Moreover, SwinGS easily scales to long video sequenceswithout compromising quality.

三维高斯拼接技术（3DGS）因其渲染速度快、质量高而在计算机视觉和计算机图形学领域备受关注。虽然现有的研究一直在努力将 3DGS 的应用从静态场景扩展到动态场景，但模型尺寸过大、视频时长限制和内容偏差一直阻碍着这些研究的进行。这些限制大大降低了动态 3D 高斯模型的可流媒体性，从而限制了它们在下游应用中的实用性，包括体积视频、自动驾驶汽车和沉浸式技术（如虚拟现实、增强现实和混合现实）。本文介绍了 SwinGS，这是一种用于以实时流方式训练、交付和渲染体积视频的新型框架。为了应对上述挑战并提高流式传输能力，SwinGS将时空高斯模型与马尔可夫链蒙特卡罗（MCMC）相结合，以调整模型来拟合各帧的各种三维场景，同时采用滑动窗口以累积的方式捕捉每帧的高斯快照。我们实现了 SwinGS 的原型，并演示了它在各种数据集和场景中的流畅性。此外，我们还开发了一个交互式 WebGL 浏览器，可以在大多数使用现代浏览器的设备上实时播放体积视频，包括智能手机和平板电脑。实验结果表明，与之前的工作相比，SwinGS 降低了 83.6% 的传输成本，同时在 PSNR 方面也没有明显妥协。此外，SwinGS 还能在不影响质量的情况下轻松扩展到长视频序列。

{"title":"SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming with Arbitrary Length","authors":"Bangya Liu, Suman Banerjee","doi":"arxiv-2409.07759","DOIUrl":"https://doi.org/arxiv-2409.07759","url":null,"abstract":"Recent advances in 3D Gaussian Splatting (3DGS) have garnered significant\u0000attention in computer vision and computer graphics due to its high rendering\u0000speed and remarkable quality. While extant research has endeavored to extend\u0000the application of 3DGS from static to dynamic scenes, such efforts have been\u0000consistently impeded by excessive model sizes, constraints on video duration,\u0000and content deviation. These limitations significantly compromise the\u0000streamability of dynamic 3D Gaussian models, thereby restricting their utility\u0000in downstream applications, including volumetric video, autonomous vehicle, and\u0000immersive technologies such as virtual, augmented, and mixed reality. This paper introduces SwinGS, a novel framework for training, delivering, and\u0000rendering volumetric video in a real-time streaming fashion. To address the\u0000aforementioned challenges and enhance streamability, SwinGS integrates\u0000spacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model to\u0000fit various 3D scenes across frames, in the meantime employing a sliding window\u0000captures Gaussian snapshots for each frame in an accumulative way. We implement\u0000a prototype of SwinGS and demonstrate its streamability across various datasets\u0000and scenes. Additionally, we develop an interactive WebGL viewer enabling\u0000real-time volumetric video playback on most devices with modern browsers,\u0000including smartphones and tablets. Experimental results show that SwinGS\u0000reduces transmission costs by 83.6% compared to previous work with ignorable\u0000compromise in PSNR. Moreover, SwinGS easily scales to long video sequences\u0000without compromising quality.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TMFNet: Two-Stream Multi-Channels Fusion Networks for Color Image Operation Chain Detection TMFNet：用于彩色图像操作链检测的双流多通道融合网络

arXiv - CS - Multimedia

Pub Date : 2024-09-12 DOI: arxiv-2409.07701

Yakun Niu, Lei Tan, Lei Zhang, Xianyu Zuo

Image operation chain detection techniques have gained increasing attentionrecently in the field of multimedia forensics. However, existing detectionmethods suffer from the generalization problem. Moreover, the channelcorrelation of color images that provides additional forensic evidence is oftenignored. To solve these issues, in this article, we propose a novel two-streammulti-channels fusion networks for color image operation chain detection inwhich the spatial artifact stream and the noise residual stream are explored ina complementary manner. Specifically, we first propose a novel deep residualarchitecture without pooling in the spatial artifact stream for learning theglobal features representation of multi-channel correlation. Then, a set offilters is designed to aggregate the correlation information of multi-channelswhile capturing the low-level features in the noise residual stream.Subsequently, the high-level features are extracted by the deep residual model.Finally, features from the two streams are fed into a fusion module, toeffectively learn richer discriminative representations of the operation chain.Extensive experiments show that the proposed method achieves state-of-the-artgeneralization ability while maintaining robustness to JPEG compression. Thesource code used in these experiments will be released athttps://github.com/LeiTan-98/TMFNet.

在多媒体取证领域，图像操作链检测技术日益受到关注。然而，现有的检测方法存在泛化问题。此外，提供额外取证证据的彩色图像通道相关性往往被忽略。为了解决这些问题，我们在本文中提出了一种用于彩色图像操作链检测的新型双流多通道融合网络，该网络以互补的方式探索空间伪影流和噪声残留流。具体来说，我们首先提出了一种新颖的深度残差架构，该架构不对空间伪影流进行池化处理，用于学习多通道相关性的全局特征表示。然后，我们设计了一组滤波器来聚合多通道的相关信息，同时捕捉噪声残差流中的低级特征。最后，通过深度残差模型提取高级特征。这些实验所使用的源代码将在https://github.com/LeiTan-98/TMFNet。

{"title":"TMFNet: Two-Stream Multi-Channels Fusion Networks for Color Image Operation Chain Detection","authors":"Yakun Niu, Lei Tan, Lei Zhang, Xianyu Zuo","doi":"arxiv-2409.07701","DOIUrl":"https://doi.org/arxiv-2409.07701","url":null,"abstract":"Image operation chain detection techniques have gained increasing attention\u0000recently in the field of multimedia forensics. However, existing detection\u0000methods suffer from the generalization problem. Moreover, the channel\u0000correlation of color images that provides additional forensic evidence is often\u0000ignored. To solve these issues, in this article, we propose a novel two-stream\u0000multi-channels fusion networks for color image operation chain detection in\u0000which the spatial artifact stream and the noise residual stream are explored in\u0000a complementary manner. Specifically, we first propose a novel deep residual\u0000architecture without pooling in the spatial artifact stream for learning the\u0000global features representation of multi-channel correlation. Then, a set of\u0000filters is designed to aggregate the correlation information of multi-channels\u0000while capturing the low-level features in the noise residual stream.\u0000Subsequently, the high-level features are extracted by the deep residual model.\u0000Finally, features from the two streams are fed into a fusion module, to\u0000effectively learn richer discriminative representations of the operation chain.\u0000Extensive experiments show that the proposed method achieves state-of-the-art\u0000generalization ability while maintaining robustness to JPEG compression. The\u0000source code used in these experiments will be released at\u0000https://github.com/LeiTan-98/TMFNet.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models Hi3D：利用视频扩散模型实现高分辨率图像到 3D 的生成

arXiv - CS - Multimedia

Pub Date : 2024-09-11 DOI: arxiv-2409.07452

Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei

Despite having tremendous progress in image-to-3D generation, existingmethods still struggle to produce multi-view consistent images withhigh-resolution textures in detail, especially in the paradigm of 2D diffusionthat lacks 3D awareness. In this work, we present High-resolution Image-to-3Dmodel (Hi3D), a new video diffusion based paradigm that redefines a singleimage to multi-view images as 3D-aware sequential image generation (i.e.,orbital video generation). This methodology delves into the underlying temporalconsistency knowledge in video diffusion model that generalizes well togeometry consistency across multiple views in 3D generation. Technically, Hi3Dfirst empowers the pre-trained video diffusion model with 3D-aware prior(camera pose condition), yielding multi-view images with low-resolution texturedetails. A 3D-aware video-to-video refiner is learnt to further scale up themulti-view images with high-resolution texture details. Such high-resolutionmulti-view images are further augmented with novel views through 3D GaussianSplatting, which are finally leveraged to obtain high-fidelity meshes via 3Dreconstruction. Extensive experiments on both novel view synthesis and singleview reconstruction demonstrate that our Hi3D manages to produce superiormulti-view consistency images with highly-detailed textures. Source code anddata are available at url{https://github.com/yanghb22-fdu/Hi3D-Official}.

尽管在图像到 3D 的生成方面取得了巨大进步，但现有方法仍难以生成具有高分辨率纹理细节的多视角一致图像，尤其是在缺乏 3D 意识的 2D 扩散范例中。在这项工作中，我们提出了高分辨率图像到三维模型（Hi3D），这是一种基于视频扩散的新范例，它将单图像到多视角图像重新定义为三维感知的连续图像生成（即轨道视频生成）。该方法深入研究了视频扩散模型中潜在的时间一致性知识，并将其很好地概括为三维生成中多视图的几何一致性。从技术上讲，Hi3Dfirst 利用三维感知先验（相机姿态条件）增强了预训练视频扩散模型的能力，从而生成具有低分辨率纹理细节的多视图图像。通过学习三维感知视频到视频细化器，可进一步放大具有高分辨率纹理细节的多视角图像。这种高分辨率多视图图像通过三维高斯拼接技术进一步增加了新视图，最后利用这些新视图通过三维重构技术获得高保真网格。对新视图合成和单视图重建的广泛实验表明，我们的 Hi3D 能够生成具有高精细纹理的超多视图一致性图像。源代码和数据可在 url{https://github.com/yanghb22-fdu/Hi3D-Official} 上获取。

{"title":"Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models","authors":"Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei","doi":"arxiv-2409.07452","DOIUrl":"https://doi.org/arxiv-2409.07452","url":null,"abstract":"Despite having tremendous progress in image-to-3D generation, existing\u0000methods still struggle to produce multi-view consistent images with\u0000high-resolution textures in detail, especially in the paradigm of 2D diffusion\u0000that lacks 3D awareness. In this work, we present High-resolution Image-to-3D\u0000model (Hi3D), a new video diffusion based paradigm that redefines a single\u0000image to multi-view images as 3D-aware sequential image generation (i.e.,\u0000orbital video generation). This methodology delves into the underlying temporal\u0000consistency knowledge in video diffusion model that generalizes well to\u0000geometry consistency across multiple views in 3D generation. Technically, Hi3D\u0000first empowers the pre-trained video diffusion model with 3D-aware prior\u0000(camera pose condition), yielding multi-view images with low-resolution texture\u0000details. A 3D-aware video-to-video refiner is learnt to further scale up the\u0000multi-view images with high-resolution texture details. Such high-resolution\u0000multi-view images are further augmented with novel views through 3D Gaussian\u0000Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D\u0000reconstruction. Extensive experiments on both novel view synthesis and single\u0000view reconstruction demonstrate that our Hi3D manages to produce superior\u0000multi-view consistency images with highly-detailed textures. Source code and\u0000data are available at url{https://github.com/yanghb22-fdu/Hi3D-Official}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation DreamMesh：联合操纵和纹理三角网格，实现文本到 3D 的生成

arXiv - CS - Multimedia

Pub Date : 2024-09-11 DOI: arxiv-2409.07454

Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei

Learning radiance fields (NeRF) with powerful 2D diffusion models hasgarnered popularity for text-to-3D generation. Nevertheless, the implicit 3Drepresentations of NeRF lack explicit modeling of meshes and textures oversurfaces, and such surface-undefined way may suffer from the issues, e.g.,noisy surfaces with ambiguous texture details or cross-view inconsistency. Toalleviate this, we present DreamMesh, a novel text-to-3D architecture thatpivots on well-defined surfaces (triangle meshes) to generate high-fidelityexplicit 3D model. Technically, DreamMesh capitalizes on a distinctivecoarse-to-fine scheme. In the coarse stage, the mesh is first deformed bytext-guided Jacobians and then DreamMesh textures the mesh with an interlaceduse of 2D diffusion models in a tuning free manner from multiple viewpoints. Inthe fine stage, DreamMesh jointly manipulates the mesh and refines the texturemap, leading to high-quality triangle meshes with high-fidelity texturedmaterials. Extensive experiments demonstrate that DreamMesh significantlyoutperforms state-of-the-art text-to-3D methods in faithfully generating 3Dcontent with richer textual details and enhanced geometry. Our project page isavailable at https://dreammesh.github.io.

学习辐射场（NeRF）具有强大的二维扩散模型，在文本到三维的生成中颇受欢迎。然而，NeRF 的隐式 3D 表示缺乏对网格和表面纹理的显式建模，而且这种未定义表面的方式可能会出现一些问题，例如纹理细节模糊或跨视角不一致的嘈杂表面。为了解决这些问题，我们提出了 DreamMesh，这是一种新颖的文本到三维架构，它以定义明确的曲面（三角形网格）为中心，生成高保真的三维模型。从技术上讲，DreamMesh 采用了一种独特的从粗到细的方案。在粗略阶段，首先通过文本引导的雅各布因子对网格进行变形，然后 DreamMesh 从多个视角以自由调整的方式交错使用二维扩散模型对网格进行纹理处理。在精细阶段，DreamMesh 对网格进行联合处理，并完善纹理贴图，从而生成具有高保真纹理材质的高质量三角形网格。大量实验证明，DreamMesh 在忠实生成具有更丰富文本细节和增强几何形状的 3D 内容方面，明显优于最先进的文本到 3D 方法。我们的项目页面位于 https://dreammesh.github.io。

{"title":"DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation","authors":"Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei","doi":"arxiv-2409.07454","DOIUrl":"https://doi.org/arxiv-2409.07454","url":null,"abstract":"Learning radiance fields (NeRF) with powerful 2D diffusion models has\u0000garnered popularity for text-to-3D generation. Nevertheless, the implicit 3D\u0000representations of NeRF lack explicit modeling of meshes and textures over\u0000surfaces, and such surface-undefined way may suffer from the issues, e.g.,\u0000noisy surfaces with ambiguous texture details or cross-view inconsistency. To\u0000alleviate this, we present DreamMesh, a novel text-to-3D architecture that\u0000pivots on well-defined surfaces (triangle meshes) to generate high-fidelity\u0000explicit 3D model. Technically, DreamMesh capitalizes on a distinctive\u0000coarse-to-fine scheme. In the coarse stage, the mesh is first deformed by\u0000text-guided Jacobians and then DreamMesh textures the mesh with an interlaced\u0000use of 2D diffusion models in a tuning free manner from multiple viewpoints. In\u0000the fine stage, DreamMesh jointly manipulates the mesh and refines the texture\u0000map, leading to high-quality triangle meshes with high-fidelity textured\u0000materials. Extensive experiments demonstrate that DreamMesh significantly\u0000outperforms state-of-the-art text-to-3D methods in faithfully generating 3D\u0000content with richer textual details and enhanced geometry. Our project page is\u0000available at https://dreammesh.github.io.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process FreeEnhance：通过内容一致的噪声和去噪过程实现无调谐图像增强

arXiv - CS - Multimedia

Pub Date : 2024-09-11 DOI: arxiv-2409.07451

Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei

The emergence of text-to-image generation models has led to the recognitionthat image enhancement, performed as post-processing, would significantlyimprove the visual quality of the generated images. Exploring diffusion modelsto enhance the generated images nevertheless is not trivial and necessitates todelicately enrich plentiful details while preserving the visual appearance ofkey content in the original image. In this paper, we propose a novel framework,namely FreeEnhance, for content-consistent image enhancement using theoff-the-shelf image diffusion models. Technically, FreeEnhance is a two-stageprocess that firstly adds random noise to the input image and then capitalizeson a pre-trained image diffusion model (i.e., Latent Diffusion Models) todenoise and enhance the image details. In the noising stage, FreeEnhance isdevised to add lighter noise to the region with higher frequency to preservethe high-frequent patterns (e.g., edge, corner) in the original image. In thedenoising stage, we present three target properties as constraints toregularize the predicted noise, enhancing images with high acutance and highvisual quality. Extensive experiments conducted on the HPDv2 datasetdemonstrate that our FreeEnhance outperforms the state-of-the-art imageenhancement models in terms of quantitative metrics and human preference. Moreremarkably, FreeEnhance also shows higher human preference compared to thecommercial image enhancement solution of Magnific AI.

文本到图像生成模型的出现使人们认识到，图像增强作为后处理将显著提高生成图像的视觉质量。然而，探索扩散模型来增强生成的图像并非易事，它需要在保留原始图像中关键内容的视觉外观的同时，巧妙地丰富大量细节。在本文中，我们提出了一个新颖的框架，即 FreeEnhance，用于使用现成的图像扩散模型进行内容一致的图像增强。从技术上讲，FreeEnhance 是一个两阶段的过程，首先在输入图像中添加随机噪声，然后利用预先训练好的图像扩散模型（即潜在扩散模型）来噪声化和增强图像细节。在噪点处理阶段，FreeEnhance 被设计为在频率较高的区域添加较轻的噪点，以保留原始图像中的高频模式（如边缘、角落）。在去噪阶段，我们将三个目标属性作为约束条件，对预测噪声进行规范化处理，从而增强图像的高敏锐度和高视觉质量。在 HPDv2 数据集上进行的大量实验表明，FreeEnhance 在定量指标和人类偏好方面都优于最先进的图像增强模型。更值得注意的是，与 Magnific AI 的商业图像增强解决方案相比，FreeEnhance 还显示出更高的人类偏好度。

{"title":"FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process","authors":"Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei","doi":"arxiv-2409.07451","DOIUrl":"https://doi.org/arxiv-2409.07451","url":null,"abstract":"The emergence of text-to-image generation models has led to the recognition\u0000that image enhancement, performed as post-processing, would significantly\u0000improve the visual quality of the generated images. Exploring diffusion models\u0000to enhance the generated images nevertheless is not trivial and necessitates to\u0000delicately enrich plentiful details while preserving the visual appearance of\u0000key content in the original image. In this paper, we propose a novel framework,\u0000namely FreeEnhance, for content-consistent image enhancement using the\u0000off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage\u0000process that firstly adds random noise to the input image and then capitalizes\u0000on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to\u0000denoise and enhance the image details. In the noising stage, FreeEnhance is\u0000devised to add lighter noise to the region with higher frequency to preserve\u0000the high-frequent patterns (e.g., edge, corner) in the original image. In the\u0000denoising stage, we present three target properties as constraints to\u0000regularize the predicted noise, enhancing images with high acutance and high\u0000visual quality. Extensive experiments conducted on the HPDv2 dataset\u0000demonstrate that our FreeEnhance outperforms the state-of-the-art image\u0000enhancement models in terms of quantitative metrics and human preference. More\u0000remarkably, FreeEnhance also shows higher human preference compared to the\u0000commercial image enhancement solution of Magnific AI.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding MIP-GAF：最重要人物定位和群体上下文理解的 MLLM 注释基准

arXiv - CS - Multimedia

Pub Date : 2024-09-10 DOI: arxiv-2409.06224

Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon

Estimating the Most Important Person (MIP) in any social event setup is achallenging problem mainly due to contextual complexity and scarcity of labeleddata. Moreover, the causality aspects of MIP estimation are quite subjectiveand diverse. To this end, we aim to address the problem by annotating alarge-scale `in-the-wild' dataset for identifying human perceptions about the`Most Important Person (MIP)' in an image. The paper provides a thoroughdescription of our proposed Multimodal Large Language Model (MLLM) based dataannotation strategy, and a thorough data quality analysis. Further, we performa comprehensive benchmarking of the proposed dataset utilizing state-of-the-artMIP localization methods, indicating a significant drop in performance comparedto existing datasets. The performance drop shows that the existing MIPlocalization algorithms must be more robust with respect to `in-the-wild'situations. We believe the proposed dataset will play a vital role in buildingthe next-generation social situation understanding methods. The code and datais available at https://github.com/surbhimadan92/MIP-GAF.

在任何社会事件中估计最重要人物（MIP）都是一个具有挑战性的问题，这主要是由于上下文的复杂性和标记数据的稀缺性。此外，MIP 估算的因果关系也相当主观和多样。为此，我们旨在通过标注大规模 "野生 "数据集来识别人类对图像中 "最重要人物（MIP）"的看法，从而解决这一问题。本文全面介绍了我们提出的基于多模态大语言模型（MLLM）的数据注释策略，并进行了深入的数据质量分析。此外，我们还利用最先进的 MIP 本地化方法对所提出的数据集进行了全面的基准测试，结果表明，与现有数据集相比，该数据集的性能大幅下降。性能下降表明，现有的 MIP 定位算法在 "野外 "情况下必须更加稳健。我们相信，所提出的数据集将在构建下一代社会情境理解方法中发挥重要作用。代码和数据可在 https://github.com/surbhimadan92/MIP-GAF 上获取。

{"title":"MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding","authors":"Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon","doi":"arxiv-2409.06224","DOIUrl":"https://doi.org/arxiv-2409.06224","url":null,"abstract":"Estimating the Most Important Person (MIP) in any social event setup is a\u0000challenging problem mainly due to contextual complexity and scarcity of labeled\u0000data. Moreover, the causality aspects of MIP estimation are quite subjective\u0000and diverse. To this end, we aim to address the problem by annotating a\u0000large-scale `in-the-wild' dataset for identifying human perceptions about the\u0000`Most Important Person (MIP)' in an image. The paper provides a thorough\u0000description of our proposed Multimodal Large Language Model (MLLM) based data\u0000annotation strategy, and a thorough data quality analysis. Further, we perform\u0000a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art\u0000MIP localization methods, indicating a significant drop in performance compared\u0000to existing datasets. The performance drop shows that the existing MIP\u0000localization algorithms must be more robust with respect to `in-the-wild'\u0000situations. We believe the proposed dataset will play a vital role in building\u0000the next-generation social situation understanding methods. The code and data\u0000is available at https://github.com/surbhimadan92/MIP-GAF.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition 为极低分辨率人脸识别提取生成-鉴别表征

arXiv - CS - Multimedia

Pub Date : 2024-09-10 DOI: arxiv-2409.06371

Junzheng Zhang, Weijia Guo, Bochao Liu, Ruixin Shi, Yong Li, Shiming Ge

Very low-resolution face recognition is challenging due to the serious lossof informative facial details in resolution degradation. In this paper, wepropose a generative-discriminative representation distillation approach thatcombines generative representation with cross-resolution aligned knowledgedistillation. This approach facilitates very low-resolution face recognition byjointly distilling generative and discriminative models via two distillationmodules. Firstly, the generative representation distillation takes the encoderof a diffusion model pretrained for face super-resolution as the generativeteacher to supervise the learning of the student backbone via featureregression, and then freezes the student backbone. After that, thediscriminative representation distillation further considers a pretrained facerecognizer as the discriminative teacher to supervise the learning of thestudent head via cross-resolution relational contrastive distillation. In thisway, the general backbone representation can be transformed into discriminativehead representation, leading to a robust and discriminative student model forvery low-resolution face recognition. Our approach improves the recovery of themissing details in very low-resolution faces and achieves better knowledgetransfer. Extensive experiments on face datasets demonstrate that our approachenhances the recognition accuracy of very low-resolution faces, showcasing itseffectiveness and adaptability.

极低分辨率的人脸识别具有挑战性，因为在分辨率下降的过程中，面部信息细节会严重丢失。在本文中，我们提出了一种生成-判别表征蒸馏方法，它将生成表征与跨分辨率对齐知识蒸馏相结合。这种方法通过两个蒸馏模块联合蒸馏生成模型和判别模型，从而促进了极低分辨率的人脸识别。首先，生成性表征蒸馏将针对人脸超分辨率预训练的扩散模型的编码器作为生成性教师，通过特征回归监督学生骨干的学习，然后冻结学生骨干。之后，判别表征蒸馏进一步考虑将预训练的人脸识别器作为判别教师，通过交叉分辨率关系对比蒸馏监督学生头部的学习。通过这种方法，一般的骨干表征可以转化为判别性头部表征，从而为极低分辨率的人脸识别提供稳健且具有判别性的学生模型。我们的方法提高了对极低分辨率人脸中缺失细节的恢复能力，并实现了更好的知识转移。在人脸数据集上的广泛实验证明，我们的方法提高了极低分辨率人脸的识别准确率，展示了它的有效性和适应性。

{"title":"Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition","authors":"Junzheng Zhang, Weijia Guo, Bochao Liu, Ruixin Shi, Yong Li, Shiming Ge","doi":"arxiv-2409.06371","DOIUrl":"https://doi.org/arxiv-2409.06371","url":null,"abstract":"Very low-resolution face recognition is challenging due to the serious loss\u0000of informative facial details in resolution degradation. In this paper, we\u0000propose a generative-discriminative representation distillation approach that\u0000combines generative representation with cross-resolution aligned knowledge\u0000distillation. This approach facilitates very low-resolution face recognition by\u0000jointly distilling generative and discriminative models via two distillation\u0000modules. Firstly, the generative representation distillation takes the encoder\u0000of a diffusion model pretrained for face super-resolution as the generative\u0000teacher to supervise the learning of the student backbone via feature\u0000regression, and then freezes the student backbone. After that, the\u0000discriminative representation distillation further considers a pretrained face\u0000recognizer as the discriminative teacher to supervise the learning of the\u0000student head via cross-resolution relational contrastive distillation. In this\u0000way, the general backbone representation can be transformed into discriminative\u0000head representation, leading to a robust and discriminative student model for\u0000very low-resolution face recognition. Our approach improves the recovery of the\u0000missing details in very low-resolution faces and achieves better knowledge\u0000transfer. Extensive experiments on face datasets demonstrate that our approach\u0000enhances the recognition accuracy of very low-resolution faces, showcasing its\u0000effectiveness and adaptability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Offloading and Enhancement for Low-Light Video Analytics on Mobile Devices 移动设备低照度视频分析的自适应卸载和增强

arXiv - CS - Multimedia

Pub Date : 2024-09-09 DOI: arxiv-2409.05297

Yuanyi He, Peng Yang, Tian Qin, Jiawei Hou, Ning Zhang

In this paper, we explore adaptive offloading and enhancement strategies forvideo analytics tasks on computing-constrained mobile devices in low-lightconditions. We observe that the accuracy of low-light video analytics variesfrom different enhancement algorithms. The root cause could be the disparitiesin the effectiveness of enhancement algorithms for feature extraction inanalytic models. Specifically, the difference in class activation maps (CAMs)between enhanced and low-light frames demonstrates a positive correlation withvideo analytics accuracy. Motivated by such observations, a novel enhancementquality assessment method is proposed on CAMs to evaluate the effectiveness ofdifferent enhancement algorithms for low-light videos. Then, we design amulti-edge system, which adaptively offloads and enhances low-light videoanalytics tasks from mobile devices. To achieve the trade-off between theenhancement quality and the latency for all system-served mobile devices, wepropose a genetic-based scheduling algorithm, which can find a near-optimalsolution in a reasonable time to meet the latency requirement. Thereby, theoffloading strategies and the enhancement algorithms are properly selectedunder the condition of limited end-edge bandwidth and edge computationresources. Simulation experiments demonstrate the superiority of the proposedsystem, improving accuracy up to 20.83% compared to existing benchmarks.

在本文中，我们探讨了在低照度条件下，计算受限的移动设备上视频分析任务的自适应卸载和增强策略。我们发现，不同增强算法的低照度视频分析准确性各不相同。根本原因可能是增强算法对分析模型中特征提取的有效性存在差异。具体来说，增强帧和低照度帧之间的类激活图（CAM）差异与视频分析的准确性呈正相关。受这些观察结果的启发，我们提出了一种基于类激活图的新型增强质量评估方法，以评估不同增强算法在低照度视频中的有效性。然后，我们设计了一个多边缘系统，可以自适应地卸载和增强移动设备的弱光视频分析任务。为了实现所有系统服务的移动设备的增强质量和延迟之间的权衡，我们提出了一种基于遗传的调度算法，它能在合理的时间内找到接近最优的解决方案，以满足延迟要求。因此，在终端边缘带宽和边缘计算资源有限的条件下，可以适当选择卸载策略和增强算法。仿真实验证明了所提系统的优越性，与现有基准相比，准确率提高了20.83%。

{"title":"Adaptive Offloading and Enhancement for Low-Light Video Analytics on Mobile Devices","authors":"Yuanyi He, Peng Yang, Tian Qin, Jiawei Hou, Ning Zhang","doi":"arxiv-2409.05297","DOIUrl":"https://doi.org/arxiv-2409.05297","url":null,"abstract":"In this paper, we explore adaptive offloading and enhancement strategies for\u0000video analytics tasks on computing-constrained mobile devices in low-light\u0000conditions. We observe that the accuracy of low-light video analytics varies\u0000from different enhancement algorithms. The root cause could be the disparities\u0000in the effectiveness of enhancement algorithms for feature extraction in\u0000analytic models. Specifically, the difference in class activation maps (CAMs)\u0000between enhanced and low-light frames demonstrates a positive correlation with\u0000video analytics accuracy. Motivated by such observations, a novel enhancement\u0000quality assessment method is proposed on CAMs to evaluate the effectiveness of\u0000different enhancement algorithms for low-light videos. Then, we design a\u0000multi-edge system, which adaptively offloads and enhances low-light video\u0000analytics tasks from mobile devices. To achieve the trade-off between the\u0000enhancement quality and the latency for all system-served mobile devices, we\u0000propose a genetic-based scheduling algorithm, which can find a near-optimal\u0000solution in a reasonable time to meet the latency requirement. Thereby, the\u0000offloading strategies and the enhancement algorithms are properly selected\u0000under the condition of limited end-edge bandwidth and edge computation\u0000resources. Simulation experiments demonstrate the superiority of the proposed\u0000system, improving accuracy up to 20.83% compared to existing benchmarks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0