Jiayi Xu, Zhengyang Wu, Chenming Zhang, Xiaogang Jin, Yaohua Ji
Fast and highly realistic multi-view hair transfer plays a crucial role in evaluating the effectiveness of virtual hair try-on systems. However, GAN-based generation and editing methods face persistent challenges in feature disentanglement. Achieving pixel-level, attribute-specific modifications—such as changing hairstyle or hair color without affecting other facial features—remains a long-standing problem. To address this limitation, we propose a novel multi-view hair transfer framework that leverages a hair-only intermediate facial representation and a 3D-guided masking mechanism. Our approach disentangles tri-plane facial features into spatial geometric components and global style descriptors, enabling independent and precise control over hairstyle and hair color. By introducing a dedicated intermediate representation focused solely on hair and incorporating a two-stage feature fusion strategy guided by the generated 3D mask, our framework achieves fine-grained local editing across multiple viewpoints while preserving facial integrity and improving background consistency. Extensive experiments demonstrate that our method produces visually compelling and natural results in side-to-front view hair transfer tasks, offering a robust and flexible solution for high-fidelity hair reconstruction and manipulation.
{"title":"Feature Disentanglement in GANs for Photorealistic Multi-view Hair Transfer","authors":"Jiayi Xu, Zhengyang Wu, Chenming Zhang, Xiaogang Jin, Yaohua Ji","doi":"10.1111/cgf.70245","DOIUrl":"https://doi.org/10.1111/cgf.70245","url":null,"abstract":"<p>Fast and highly realistic multi-view hair transfer plays a crucial role in evaluating the effectiveness of virtual hair try-on systems. However, GAN-based generation and editing methods face persistent challenges in feature disentanglement. Achieving pixel-level, attribute-specific modifications—such as changing hairstyle or hair color without affecting other facial features—remains a long-standing problem. To address this limitation, we propose a novel multi-view hair transfer framework that leverages a hair-only intermediate facial representation and a 3D-guided masking mechanism. Our approach disentangles tri-plane facial features into spatial geometric components and global style descriptors, enabling independent and precise control over hairstyle and hair color. By introducing a dedicated intermediate representation focused solely on hair and incorporating a two-stage feature fusion strategy guided by the generated 3D mask, our framework achieves fine-grained local editing across multiple viewpoints while preserving facial integrity and improving background consistency. Extensive experiments demonstrate that our method produces visually compelling and natural results in side-to-front view hair transfer tasks, offering a robust and flexible solution for high-fidelity hair reconstruction and manipulation.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Per-garment virtual try-on methods collect garment-specific datasets and train networks tailored to each garment to achieve superior results. However, these approaches often struggle with loose-fitting garments due to two key limitations: (1) They rely on human body semantic maps to align garments with the body, but these maps become unreliable when body contours are obscured by loose-fitting garments, resulting in degraded outcomes; (2) They train garment synthesis networks on a per-frame basis without utilizing temporal information, leading to noticeable jittering artifacts. To address the first limitation, we propose a two-stage approach for robust semantic map estimation. First, we extract a garment-invariant representation from the raw input image. This representation is then passed through an auxiliary network to estimate the semantic map. This enhances the robustness of semantic map estimation under loose-fitting garments during garment-specific dataset generation. To address the second limitation, we introduce a recurrent garment synthesis framework that incorporates temporal dependencies to improve frame-to-frame coherence while maintaining real-time performance. We conducted qualitative and quantitative evaluations to demonstrate that our method outperforms existing approaches in both image quality and temporal coherence. Ablation studies further validate the effectiveness of the garment-invariant representation and the recurrent synthesis framework.
{"title":"Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments","authors":"Zaiqiang Wu, I-Chao Shen, Takeo Igarashi","doi":"10.1111/cgf.70272","DOIUrl":"https://doi.org/10.1111/cgf.70272","url":null,"abstract":"<p>Per-garment virtual try-on methods collect garment-specific datasets and train networks tailored to each garment to achieve superior results. However, these approaches often struggle with loose-fitting garments due to two key limitations: (1) They rely on human body semantic maps to align garments with the body, but these maps become unreliable when body contours are obscured by loose-fitting garments, resulting in degraded outcomes; (2) They train garment synthesis networks on a per-frame basis without utilizing temporal information, leading to noticeable jittering artifacts. To address the first limitation, we propose a two-stage approach for robust semantic map estimation. First, we extract a garment-invariant representation from the raw input image. This representation is then passed through an auxiliary network to estimate the semantic map. This enhances the robustness of semantic map estimation under loose-fitting garments during garment-specific dataset generation. To address the second limitation, we introduce a recurrent garment synthesis framework that incorporates temporal dependencies to improve frame-to-frame coherence while maintaining real-time performance. We conducted qualitative and quantitative evaluations to demonstrate that our method outperforms existing approaches in both image quality and temporal coherence. Ablation studies further validate the effectiveness of the garment-invariant representation and the recurrent synthesis framework.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/cgf.70272","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Hausdorff distance is a fundamental metric with widespread applications across various fields. However, its computation remains computationally expensive, especially for large-scale datasets. This work targets exact point-to-point Hausdorff distance on point sets. In this work, we present RT-HDIST, the first Hausdorff distance algorithm accelerated by ray-tracing cores (RT-cores). By reformulating the Hausdorff distance problem as a series of nearest-neighbor searches and introducing a novel quantized voxel-index space, RT-HDIST achieves significant reductions in computational overhead while maintaining exact results. Extensive benchmarks demonstrate up to a two-order-of-magnitude speedup over prior state-of-the-art methods, underscoring RT-HDIST's potential for real-time and large-scale applications.
{"title":"RT-HDIST: Ray-Tracing Core-based Hausdorff Distance Computation","authors":"Young Woo Kim, Jaehong Lee, Duksu Kim","doi":"10.1111/cgf.70229","DOIUrl":"https://doi.org/10.1111/cgf.70229","url":null,"abstract":"<p>The Hausdorff distance is a fundamental metric with widespread applications across various fields. However, its computation remains computationally expensive, especially for large-scale datasets. This work targets exact point-to-point Hausdorff distance on point sets. In this work, we present RT-HDIST, the first Hausdorff distance algorithm accelerated by ray-tracing cores (RT-cores). By reformulating the Hausdorff distance problem as a series of nearest-neighbor searches and introducing a novel quantized voxel-index space, RT-HDIST achieves significant reductions in computational overhead while maintaining exact results. Extensive benchmarks demonstrate up to a two-order-of-magnitude speedup over prior state-of-the-art methods, underscoring RT-HDIST's potential for real-time and large-scale applications.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Zhao, Liangchen Li, Yuqi Zhou, Kai Wang, Yan Liang, Juyong Zhang
Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.
{"title":"Joint Deblurring and 3D Reconstruction for Macrophotography","authors":"Yifan Zhao, Liangchen Li, Yuqi Zhou, Kai Wang, Yan Liang, Juyong Zhang","doi":"10.1111/cgf.70253","DOIUrl":"https://doi.org/10.1111/cgf.70253","url":null,"abstract":"<p>Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Displacement mapping is an important tool for modeling detailed geometric features. We explore the problem of authoring complex surfaces while ray tracing interactively. Current techniques for ray tracing displaced surfaces rely on acceleration structures that require dynamic rebuilding when edited. These techniques are typically used for massive static scenes or the compression of detailed source assets. Our interest lies in modeling and look development of artistic features with real-time ray tracing. We introduce projective displacement mapping as a direct sampling method combined with a hardware BVH. Quality and performance are improved over existing methods with smoothed displaced normals, thin feature sampling, tight prism bounds and ray bi-linear patch intersections.
{"title":"Projective Displacement Mapping for Ray Traced Editable Surfaces","authors":"Rama Hoetzlein","doi":"10.1111/cgf.70235","DOIUrl":"https://doi.org/10.1111/cgf.70235","url":null,"abstract":"<p>Displacement mapping is an important tool for modeling detailed geometric features. We explore the problem of authoring complex surfaces while ray tracing interactively. Current techniques for ray tracing displaced surfaces rely on acceleration structures that require dynamic rebuilding when edited. These techniques are typically used for massive static scenes or the compression of detailed source assets. Our interest lies in modeling and look development of artistic features with real-time ray tracing. We introduce projective displacement mapping as a direct sampling method combined with a hardware BVH. Quality and performance are improved over existing methods with smoothed displaced normals, thin feature sampling, tight prism bounds and ray bi-linear patch intersections.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/cgf.70235","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose an efficient single-stage hybrid architecture for image completion. Existing transformer-based image completion methods often struggle with accurate content restoration, largely due to their ineffective modeling of corrupted channel information and the attention noise introduced by softmax-based mechanisms, which results in blurry textures and distorted structures. Additionally, these methods frequently fail to maintain texture consistency, either relying on imprecise mask sampling or incurring substantial computational costs from complex similarity calculations. To address these limitations, we present two key contributions: a Hybrid Sparse Self-Attention (HSA) module and a Feature Alignment Module (FAM). The HSA module enhances structural recovery by decoupling spatial and channel attention with sparse activation, while the FAM enforces texture consistency by aligning encoder and decoder features via a mask-free, energy-gated mechanism without additional inference cost. Our method achieves state-of-the-art image completion results with the fastest inference speed among single-stage networks, as measured by PSNR, SSIM, FID, and LPIPS on CelebA-HQ, Places2, and Paris datasets.
{"title":"Hybrid Sparse Transformer and Feature Alignment for Efficient Image Completion","authors":"L. Chen, H. Sun","doi":"10.1111/cgf.70255","DOIUrl":"https://doi.org/10.1111/cgf.70255","url":null,"abstract":"<p>In this paper, we propose an efficient single-stage hybrid architecture for image completion. Existing transformer-based image completion methods often struggle with accurate content restoration, largely due to their ineffective modeling of corrupted channel information and the attention noise introduced by softmax-based mechanisms, which results in blurry textures and distorted structures. Additionally, these methods frequently fail to maintain texture consistency, either relying on imprecise mask sampling or incurring substantial computational costs from complex similarity calculations. To address these limitations, we present two key contributions: a Hybrid Sparse Self-Attention (HSA) module and a Feature Alignment Module (FAM). The HSA module enhances structural recovery by decoupling spatial and channel attention with sparse activation, while the FAM enforces texture consistency by aligning encoder and decoder features via a mask-free, energy-gated mechanism without additional inference cost. Our method achieves state-of-the-art image completion results with the fastest inference speed among single-stage networks, as measured by PSNR, SSIM, FID, and LPIPS on CelebA-HQ, Places2, and Paris datasets.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating 3D objects with complex topologies from monocular images remains a challenge in computer graphics, due to the difficulty of modeling varying 3D shapes with disentangled, steerable geometry and visual attributes. While NeRF-based methods suffer from slow volumetric rendering and limited structural controllability. Recent advances in 3D Gaussian Splatting provide a more efficient alternative and its generative modeling with separate control over structure and appearance remains underexplored. In this paper, we propose G-SplatGAN, a novel 3D-aware generation framework that combines the rendering efficiency of 3D Gaussian Splatting with disentangled latent modeling. Starting from a shared Gaussian template, our method uses dual modulation branches to modulate geometry and appearance from independent latent codes, enabling precise shape manipulation and controllable generation. We adopt a progressive adversarial training scheme with multi-scale and patch-based discriminators to capture both global structure and local detail. Our model requires no 3D supervision and is trained on monocular images with known camera poses, reducing data reliance while supporting real image inversion through a geometry-aware encoder. Experiments show that G-SplatGAN achieves superior performance in rendering speed, controllability and image fidelity, offering a compelling solution for controllable 3D generation using Gaussian representations.
{"title":"G-SplatGAN: Disentangled 3D Gaussian Generation for Complex Shapes via Multi-Scale Patch Discriminators","authors":"Jiaqi Li, Haochuan Dang, Zhi Zhou, Junke Zhu, Zhangjin Huang","doi":"10.1111/cgf.70256","DOIUrl":"https://doi.org/10.1111/cgf.70256","url":null,"abstract":"<p>Generating 3D objects with complex topologies from monocular images remains a challenge in computer graphics, due to the difficulty of modeling varying 3D shapes with disentangled, steerable geometry and visual attributes. While NeRF-based methods suffer from slow volumetric rendering and limited structural controllability. Recent advances in 3D Gaussian Splatting provide a more efficient alternative and its generative modeling with separate control over structure and appearance remains underexplored. In this paper, we propose <b>G-SplatGAN</b>, a novel 3D-aware generation framework that combines the rendering efficiency of 3D Gaussian Splatting with disentangled latent modeling. Starting from a shared Gaussian template, our method uses dual modulation branches to modulate geometry and appearance from independent latent codes, enabling precise shape manipulation and controllable generation. We adopt a progressive adversarial training scheme with multi-scale and patch-based discriminators to capture both global structure and local detail. Our model requires no 3D supervision and is trained on monocular images with known camera poses, reducing data reliance while supporting real image inversion through a geometry-aware encoder. Experiments show that G-SplatGAN achieves superior performance in rendering speed, controllability and image fidelity, offering a compelling solution for controllable 3D generation using Gaussian representations.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Stroh, Patrick Paetzold, Daniel Berio, Rebecca Kehlbeck, Frederic Fol Leymarie, Oliver Deussen, Noura Faraj
We present an adaptive, semantics-based abstraction approach that balances aesthetic quality and structural coherence within the practical constraints of robotic painting. We apply panoptic segmentation with color-based over-segmentation to partition images into meaningful regions aligned with semantic objects, while providing flexible abstraction levels. Automatic parameter selection for region merging is enabled by semantic saliency maps, derived from Out-of-Distribution segmentation techniques in combination with machine learning methods for feature detection. This preserves the boundaries of salient objects while simplifying less prominent regions. A graph-based community detection step further refines the abstraction by grouping regions according to local connectivity and semantic coherence. The runtime of our method outperforms optimization-based image vectorization methods, enabling the efficient generation of multiple abstraction levels that can serve as hierarchical layers for robotic painting. We demonstrate the quality of our method by showing abstraction results, robotic paintings with the e-David robot, and a comparison to other abstraction methods.
{"title":"Using Saliency for Semantic Image Abstractions in Robotic Painting","authors":"Michael Stroh, Patrick Paetzold, Daniel Berio, Rebecca Kehlbeck, Frederic Fol Leymarie, Oliver Deussen, Noura Faraj","doi":"10.1111/cgf.70259","DOIUrl":"https://doi.org/10.1111/cgf.70259","url":null,"abstract":"<p>We present an adaptive, semantics-based abstraction approach that balances aesthetic quality and structural coherence within the practical constraints of robotic painting. We apply panoptic segmentation with color-based over-segmentation to partition images into meaningful regions aligned with semantic objects, while providing flexible abstraction levels. Automatic parameter selection for region merging is enabled by semantic saliency maps, derived from Out-of-Distribution segmentation techniques in combination with machine learning methods for feature detection. This preserves the boundaries of salient objects while simplifying less prominent regions. A graph-based community detection step further refines the abstraction by grouping regions according to local connectivity and semantic coherence. The runtime of our method outperforms optimization-based image vectorization methods, enabling the efficient generation of multiple abstraction levels that can serve as hierarchical layers for robotic painting. We demonstrate the quality of our method by showing abstraction results, robotic paintings with the e-David robot, and a comparison to other abstraction methods.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/cgf.70259","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Constructing and sharing 3D maps is essential for many applications, including autonomous driving and augmented reality. Recently, 3D Gaussian splatting has emerged as a promising approach for accurate 3D reconstruction. However, a practical map-sharing system that features high-fidelity, continuous updates, and network efficiency remains elusive. To address these challenges, we introduce GS-Share, a photorealistic map-sharing system with a compact representation. The core of GS-Share includes anchor-based global map construction, virtual-image-based map enhancement, and incremental map update. We evaluate GS-Share against state-of-the-art methods, demonstrating that our system achieves higher fidelity, particularly for extrapolated views, with improvements of 11%, 22%, and 74% in PSNR, LPIPS, and Depth L1, respectively. Furthermore, GS-Share is significantly more compact, reducing map transmission overhead by 36%.
{"title":"GS-Share: Enabling High-fidelity Map Sharing with Incremental Gaussian Splatting","authors":"Xinran Zhang, Hanqi Zhu, Yifan Duan, Yanyong Zhang","doi":"10.1111/cgf.70248","DOIUrl":"https://doi.org/10.1111/cgf.70248","url":null,"abstract":"<p>Constructing and sharing 3D maps is essential for many applications, including autonomous driving and augmented reality. Recently, 3D Gaussian splatting has emerged as a promising approach for accurate 3D reconstruction. However, a practical map-sharing system that features high-fidelity, continuous updates, and network efficiency remains elusive. To address these challenges, we introduce GS-Share, a photorealistic map-sharing system with a compact representation. The core of GS-Share includes anchor-based global map construction, virtual-image-based map enhancement, and incremental map update. We evaluate GS-Share against state-of-the-art methods, demonstrating that our system achieves higher fidelity, particularly for extrapolated views, with improvements of 11%, 22%, and 74% in PSNR, LPIPS, and Depth L1, respectively. Furthermore, GS-Share is significantly more compact, reducing map transmission overhead by 36%.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While pre-trained 3D vision-language models are becoming increasingly available, there remains a lack of frameworks that can effectively harness their capabilities for few-shot classification. In this work, we propose PointGMDA, a training-free framework that combines Gaussian Mixture Models (GMMs) with Gaussian Discriminant Analysis (GDA) to perform robust classification using only a few labeled point cloud samples. Our method estimates GMM parameters per class from support data and computes mixture-weighted prototypes, which are then used in GDA with a shared covariance matrix to construct decision boundaries. This formulation allows us to model intra-class variability more expressively than traditional single-prototype approaches, while maintaining analytical tractability. To incorporate semantic priors, we integrate CLIP-style textual prompts and fuse predictions from geometric and textual modalities through a hybrid scoring strategy. We further introduce PointGMDA-T, a lightweight attention-guided refinement module that learns residuals for fast feature adaptation, improving robustness under distribution shift. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate that PointGMDA outperforms strong baselines across a variety of few-shot settings, with consistent gains under both training-free and fine-tuned conditions. These results highlight the effectiveness and generality of our probabilistic modeling and multimodal adaptation framework. Our code is publicly available at https://github.com/djzgroup/PointGMDA.
{"title":"Multimodal 3D Few-Shot Classification via Gaussian Mixture Discriminant Analysis","authors":"Yiqi Wu, Huachao Wu, Ronglei Hu, Yilin Chen, Dejun Zhang","doi":"10.1111/cgf.70268","DOIUrl":"https://doi.org/10.1111/cgf.70268","url":null,"abstract":"<p>While pre-trained 3D vision-language models are becoming increasingly available, there remains a lack of frameworks that can effectively harness their capabilities for few-shot classification. In this work, we propose PointGMDA, a training-free framework that combines Gaussian Mixture Models (GMMs) with Gaussian Discriminant Analysis (GDA) to perform robust classification using only a few labeled point cloud samples. Our method estimates GMM parameters per class from support data and computes mixture-weighted prototypes, which are then used in GDA with a shared covariance matrix to construct decision boundaries. This formulation allows us to model intra-class variability more expressively than traditional single-prototype approaches, while maintaining analytical tractability. To incorporate semantic priors, we integrate CLIP-style textual prompts and fuse predictions from geometric and textual modalities through a hybrid scoring strategy. We further introduce PointGMDA-T, a lightweight attention-guided refinement module that learns residuals for fast feature adaptation, improving robustness under distribution shift. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate that PointGMDA outperforms strong baselines across a variety of few-shot settings, with consistent gains under both training-free and fine-tuned conditions. These results highlight the effectiveness and generality of our probabilistic modeling and multimodal adaptation framework. Our code is publicly available at https://github.com/djzgroup/PointGMDA.</p>","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 7","pages":""},"PeriodicalIF":2.9,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145297327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}