The identification and recognition of urban features are essential for creating accurate and comprehensive digital representations of cities. In particular, the automatic characterization of façade elements plays a key role in enabling semantic enrichment and 3D reconstruction. It also supports urban analysis and underpins various applications, including planning, simulation, and visualization. This work presents a pipeline for the automatic recognition of façades within complex urban scenes represented as point clouds. The method employs an enhanced partitioning strategy that extends beyond strict building footprints by incorporating surrounding buffer zones, allowing for a more complete capture of façade geometry, particularly in dense urban contexts. This is combined with a primitive recognition stage based on the Hough transform, enabling the detection of both planar and curved façade structures. The proposed partitioning overcomes the limitations of traditional footprint-based segmentation, which often disregards contextual geometry and leads to misclassifications at building boundaries. Integrated with the primitive recognition step, the resulting pipeline is robust to noise and incomplete data, and supports geometry-aware façade recognition, contributing to scalable analysis of large-scale urban environments.
{"title":"PBF-FR: Partitioning beyond footprints for façade recognition in urban point clouds","authors":"Daniela Cabiddu , Chiara Romanengo , Michela Mortara","doi":"10.1016/j.cag.2025.104399","DOIUrl":"10.1016/j.cag.2025.104399","url":null,"abstract":"<div><div>The identification and recognition of urban features are essential for creating accurate and comprehensive digital representations of cities. In particular, the automatic characterization of façade elements plays a key role in enabling semantic enrichment and 3D reconstruction. It also supports urban analysis and underpins various applications, including planning, simulation, and visualization. This work presents a pipeline for the automatic recognition of façades within complex urban scenes represented as point clouds. The method employs an enhanced partitioning strategy that extends beyond strict building footprints by incorporating surrounding buffer zones, allowing for a more complete capture of façade geometry, particularly in dense urban contexts. This is combined with a primitive recognition stage based on the Hough transform, enabling the detection of both planar and curved façade structures. The proposed partitioning overcomes the limitations of traditional footprint-based segmentation, which often disregards contextual geometry and leads to misclassifications at building boundaries. Integrated with the primitive recognition step, the resulting pipeline is robust to noise and incomplete data, and supports geometry-aware façade recognition, contributing to scalable analysis of large-scale urban environments.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104399"},"PeriodicalIF":2.8,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-15DOI: 10.1016/j.cag.2025.104353
Shaorong Sun , Shuchao Pang , Yazhou Yao , Xiaoshui Huang
The controllability of 3D object generation methods is achieved through textual input. Existing text-to-3D object generation methods focus primarily on generating a single object based on a single object description. However, these methods often face challenges in producing results that accurately correspond to our desired positions when the input text involves multiple objects. To address the issue of controllability in the generation of multiple objects, this paper introduces COMOGen, a COntrollable text-to-3D Multi-Object Generation framework. COMOGen enables the simultaneous generation of multiple 3D objects by distilling layout and multiview prior knowledge. The framework consists of three modules: the layout control module, the multiview consistency control module, and the 3D content enhancement module. Moreover, to integrate these three modules as an integral framework, we propose Layout Multiview Score Distillation, which unifies two prior knowledge and further enhances the diversity and quality of generated 3D content. Comprehensive experiments demonstrate the effectiveness of our approach compared to state-of-the-art methods. This represents a significant step forward to enable more controlled and versatile text-based 3D content generation.
{"title":"Controllable text-to-3D multi-object generation via integrating layout and multiview patterns","authors":"Shaorong Sun , Shuchao Pang , Yazhou Yao , Xiaoshui Huang","doi":"10.1016/j.cag.2025.104353","DOIUrl":"10.1016/j.cag.2025.104353","url":null,"abstract":"<div><div>The controllability of 3D object generation methods is achieved through textual input. Existing text-to-3D object generation methods focus primarily on generating a single object based on a single object description. However, these methods often face challenges in producing results that accurately correspond to our desired positions when the input text involves multiple objects. To address the issue of controllability in the generation of multiple objects, this paper introduces COMOGen, a <strong>CO</strong>ntrollable text-to-3D <strong>M</strong>ulti-<strong>O</strong>bject <strong>Gen</strong>eration framework. COMOGen enables the simultaneous generation of multiple 3D objects by distilling layout and multiview prior knowledge. The framework consists of three modules: the layout control module, the multiview consistency control module, and the 3D content enhancement module. Moreover, to integrate these three modules as an integral framework, we propose Layout Multiview Score Distillation, which unifies two prior knowledge and further enhances the diversity and quality of generated 3D content. Comprehensive experiments demonstrate the effectiveness of our approach compared to state-of-the-art methods. This represents a significant step forward to enable more controlled and versatile text-based 3D content generation.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104353"},"PeriodicalIF":2.8,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1016/j.cag.2025.104381
Jialu Xi, Shiguang Liu
With the development of various emerging devices (e.g., AR/VR) and video dissemination technologies, self-centered video tasks have received much attention, and it is especially important to understand user actions in self-centered videos, where self-centered temporal action segmentation complicates the task due to its unique challenges such as abrupt point-of-view shifts and limited field of view. Existing work employs Transformer-based architectures to model long-range dependencies in sequential data. However, these models often struggle to effectively accommodate the nuances of egocentric action segmentation and incur significant computational costs. Therefore, we propose a new framework that integrates focus modulation into the Transformer architecture. Unlike the traditional self-attention mechanism, which focuses uniformly on all features in the entire sequence, focus modulation replaces the self-attention layer with a more focused and efficient mechanism. This design allows for selective aggregation of local features and adaptive integration of global context through content-aware gating, which is critical for capturing detailed local motion (e.g., hand-object interactions) and handling dynamic context changes in first-person video. Our model also adds a context integration module, where focus modulation ensures that only relevant global contexts are integrated based on the content of the current frame, ultimately efficiently decoding aggregated features to produce accurate temporal action boundaries. By using focus modulation, our model achieves a lightweight design that reduces the number of parameters typically associated with Transformer-based models. We validate the effectiveness of our approach on classical datasets for temporal segmentation tasks (50salads, breakfast) as well as additional datasets with a first-person perspective (GTEA, HOI4D, and FineBio).
{"title":"FocalFormer: Leveraging focal modulation for efficient action segmentation in egocentric videos","authors":"Jialu Xi, Shiguang Liu","doi":"10.1016/j.cag.2025.104381","DOIUrl":"10.1016/j.cag.2025.104381","url":null,"abstract":"<div><div>With the development of various emerging devices (e.g., AR/VR) and video dissemination technologies, self-centered video tasks have received much attention, and it is especially important to understand user actions in self-centered videos, where self-centered temporal action segmentation complicates the task due to its unique challenges such as abrupt point-of-view shifts and limited field of view. Existing work employs Transformer-based architectures to model long-range dependencies in sequential data. However, these models often struggle to effectively accommodate the nuances of egocentric action segmentation and incur significant computational costs. Therefore, we propose a new framework that integrates focus modulation into the Transformer architecture. Unlike the traditional self-attention mechanism, which focuses uniformly on all features in the entire sequence, focus modulation replaces the self-attention layer with a more focused and efficient mechanism. This design allows for selective aggregation of local features and adaptive integration of global context through content-aware gating, which is critical for capturing detailed local motion (e.g., hand-object interactions) and handling dynamic context changes in first-person video. Our model also adds a context integration module, where focus modulation ensures that only relevant global contexts are integrated based on the content of the current frame, ultimately efficiently decoding aggregated features to produce accurate temporal action boundaries. By using focus modulation, our model achieves a lightweight design that reduces the number of parameters typically associated with Transformer-based models. We validate the effectiveness of our approach on classical datasets for temporal segmentation tasks (50salads, breakfast) as well as additional datasets with a first-person perspective (GTEA, HOI4D, and FineBio).</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104381"},"PeriodicalIF":2.8,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1016/j.cag.2025.104414
Jorik Jakober , Matthias Kunz , Robert Kreher , Matteo Pantano , Daniel Braß , Janine Weidling , Christian Hansen , Rüdiger Braun-Dullaeus , Bernhard Preim
Strong procedural skills are essential to perform safe and effective transcatheter aortic valve replacement (TAVR). Traditional training takes place in the operating room (OR) on real patients and requires learning new motor skills, resulting in longer procedure times, increased risk of complications, and greater radiation exposure for patients and medical personnel. Desktop-based simulators in interventional cardiology have shown some validity but lack true depth perception, whereas head-mounted display based Virtual Reality (VR) offers intuitive 3D interaction that enhances training effectiveness and spatial understanding. However, providing realistic and immersive training remains a challenging task as both lack tactile feedback. We have developed an augmented virtuality (AV) training system for transfemoral TAVR, combining a catheter tracking device (for translational input) with a simulated virtual OR. The system enables users to manually control a virtual angiography system via hand tracking and navigate a guidewire through a virtual patient up to the aortic valve using fluoroscopic-like imaging. In addition, we conducted a preliminary user study with 12 participants, assessing cybersickness, usability, workload, sense of presence, and qualitative factors. Preliminary results indicate that the system provides realistic interaction for key procedural steps, making it a suitable learning tool for novices. Limitations in angiography system operation include the lack of haptic resistance and usability limitations related to C-arm control, particularly due to hand tracking constraints and split attention between interaction and monitoring. Suggestions for improvement include catheter rotation tracking, expanded procedural coverage, and enhanced fluoroscopic image fidelity.
{"title":"Design, development, and evaluation of an immersive augmented virtuality training system for transcatheter aortic valve replacement","authors":"Jorik Jakober , Matthias Kunz , Robert Kreher , Matteo Pantano , Daniel Braß , Janine Weidling , Christian Hansen , Rüdiger Braun-Dullaeus , Bernhard Preim","doi":"10.1016/j.cag.2025.104414","DOIUrl":"10.1016/j.cag.2025.104414","url":null,"abstract":"<div><div>Strong procedural skills are essential to perform safe and effective transcatheter aortic valve replacement (TAVR). Traditional training takes place in the operating room (OR) on real patients and requires learning new motor skills, resulting in longer procedure times, increased risk of complications, and greater radiation exposure for patients and medical personnel. Desktop-based simulators in interventional cardiology have shown some validity but lack true depth perception, whereas head-mounted display based Virtual Reality (VR) offers intuitive 3D interaction that enhances training effectiveness and spatial understanding. However, providing realistic and immersive training remains a challenging task as both lack tactile feedback. We have developed an augmented virtuality (AV) training system for transfemoral TAVR, combining a catheter tracking device (for translational input) with a simulated virtual OR. The system enables users to manually control a virtual angiography system via hand tracking and navigate a guidewire through a virtual patient up to the aortic valve using fluoroscopic-like imaging. In addition, we conducted a preliminary user study with 12 participants, assessing cybersickness, usability, workload, sense of presence, and qualitative factors. Preliminary results indicate that the system provides realistic interaction for key procedural steps, making it a suitable learning tool for novices. Limitations in angiography system operation include the lack of haptic resistance and usability limitations related to C-arm control, particularly due to hand tracking constraints and split attention between interaction and monitoring. Suggestions for improvement include catheter rotation tracking, expanded procedural coverage, and enhanced fluoroscopic image fidelity.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104414"},"PeriodicalIF":2.8,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145160105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.
{"title":"Navigating large-pose challenge for high-fidelity face reenactment with video diffusion model","authors":"Mingtao Guo , Guanyu Xing , Yanci Zhang , Yanli Liu","doi":"10.1016/j.cag.2025.104423","DOIUrl":"10.1016/j.cag.2025.104423","url":null,"abstract":"<div><div>Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104423"},"PeriodicalIF":2.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1016/j.cag.2025.104424
Chiara Romanengo , Tommaso Sorgente , Daniela Cabiddu , Matteo Ghellere , Lorenzo Belussi , Ludovico Danza , Michela Mortara
Aerial LiDAR (and photogrammetric) surveys are becoming a common practice in land and urban management, and aerial point clouds (or the reconstructed surfaces) are increasingly used as digital representations of natural and built structures for the monitoring and simulation of urban processes or the generation of what-if scenarios. The geometric analysis of a “digital twin” of the built environment can contribute to provide quantitative evidence to support urban policies like planning of interventions and incentives for the transition to renewable energy. In this work, we present a geometry-based approach to efficiently and accurately estimate the photovoltaic (PV) energy produced by urban roofs. The method combines a primitive fitting technique for detecting and characterizing building roof components from aerial LiDAR data with an optimization strategy to determine the maximum number and optimal placement of PV modules on each roof surface. The energy production of the PV system on each building over a specified time period (e.g., one year) is estimated based on the solar radiation received by each PV module and the shadow projected by neighboring buildings or trees and efficiency requirements. The strength of the proposed approach is its ability to combine computational techniques, domain expertise, and heterogeneous data into a logical and automated workflow, whose effectiveness is evaluated and tested on a large-scale, real-world urban areas with complex morphology in Italy.
{"title":"Geometry-aware estimation of photovoltaic energy from aerial LiDAR point clouds","authors":"Chiara Romanengo , Tommaso Sorgente , Daniela Cabiddu , Matteo Ghellere , Lorenzo Belussi , Ludovico Danza , Michela Mortara","doi":"10.1016/j.cag.2025.104424","DOIUrl":"10.1016/j.cag.2025.104424","url":null,"abstract":"<div><div>Aerial LiDAR (and photogrammetric) surveys are becoming a common practice in land and urban management, and aerial point clouds (or the reconstructed surfaces) are increasingly used as digital representations of natural and built structures for the monitoring and simulation of urban processes or the generation of what-if scenarios. The geometric analysis of a “digital twin” of the built environment can contribute to provide quantitative evidence to support urban policies like planning of interventions and incentives for the transition to renewable energy. In this work, we present a geometry-based approach to efficiently and accurately estimate the photovoltaic (PV) energy produced by urban roofs. The method combines a primitive fitting technique for detecting and characterizing building roof components from aerial LiDAR data with an optimization strategy to determine the maximum number and optimal placement of PV modules on each roof surface. The energy production of the PV system on each building over a specified time period (e.g., one year) is estimated based on the solar radiation received by each PV module and the shadow projected by neighboring buildings or trees and efficiency requirements. The strength of the proposed approach is its ability to combine computational techniques, domain expertise, and heterogeneous data into a logical and automated workflow, whose effectiveness is evaluated and tested on a large-scale, real-world urban areas with complex morphology in Italy.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104424"},"PeriodicalIF":2.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1016/j.cag.2025.104419
Jiayun Hu , Shiqi Jiang , Haiwen Huang , Shuqi Liu , Yun Wang , Changbo Wang , Chenhui Li
Color is powerful in communicating information in visualizations. However, crafting palettes that improve readability and capture readers’ attention often demands substantial effort, even for seasoned designers. Existing text-based palette generation results in limited and predictable combinations, and finding suitable reference images to extract colors without a clear idea is both tedious and frustrating. In this work, we present Prompt2Color, a novel framework for generating color palettes using prompts. To simplify the process of finding relevant images, we first adopt a concretization approach to visualize the prompts. Furthermore, we introduce an attention-based method for color extraction, which allows for the mining of the visual representations of the prompts. Finally, we utilize a knowledge base to refine the palette and generate the background color to meet aesthetic and design requirements. Evaluations, including quantitative metrics and user experiments, demonstrate the effectiveness of our method.
{"title":"Prompt2Color: A prompt-based framework for image-derived color generation and visualization optimization","authors":"Jiayun Hu , Shiqi Jiang , Haiwen Huang , Shuqi Liu , Yun Wang , Changbo Wang , Chenhui Li","doi":"10.1016/j.cag.2025.104419","DOIUrl":"10.1016/j.cag.2025.104419","url":null,"abstract":"<div><div>Color is powerful in communicating information in visualizations. However, crafting palettes that improve readability and capture readers’ attention often demands substantial effort, even for seasoned designers. Existing text-based palette generation results in limited and predictable combinations, and finding suitable reference images to extract colors without a clear idea is both tedious and frustrating. In this work, we present Prompt2Color, a novel framework for generating color palettes using prompts. To simplify the process of finding relevant images, we first adopt a concretization approach to visualize the prompts. Furthermore, we introduce an attention-based method for color extraction, which allows for the mining of the visual representations of the prompts. Finally, we utilize a knowledge base to refine the palette and generate the background color to meet aesthetic and design requirements. Evaluations, including quantitative metrics and user experiments, demonstrate the effectiveness of our method.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104419"},"PeriodicalIF":2.8,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1016/j.cag.2025.104358
Kazi Injamamul Haque, Sichun Wu, Zerrin Yumak
Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, the majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we present ProbTalk3D-X by extending a prior work ProbTalk3D- a two staged VQ-VAE based non-deterministic model, by additionally incorporating prosody features for improved facial accuracy using an emotionally rich facial animation dataset, 3DMEAD. Further, we present a comprehensive comparison of non-deterministic emotion controllable models (including new extended experimental models) leveraging VQ-VAE, VAE and diffusion techniques. We provide an extensive comparative analysis of the experimental models against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. Our evaluation demonstrates that ProbTalk3D-X and original ProbTalk3D achieve superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for visual quality judgment. The entire codebase including the extended models is publicly available.1
{"title":"ProbTalk3D-X: Prosody enhanced non-deterministic emotion controllable speech-driven 3D facial animation synthesis","authors":"Kazi Injamamul Haque, Sichun Wu, Zerrin Yumak","doi":"10.1016/j.cag.2025.104358","DOIUrl":"10.1016/j.cag.2025.104358","url":null,"abstract":"<div><div>Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, the majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we present ProbTalk3D-X by extending a prior work ProbTalk3D- a two staged VQ-VAE based non-deterministic model, by additionally incorporating prosody features for improved facial accuracy using an emotionally rich facial animation dataset, 3DMEAD. Further, we present a comprehensive comparison of non-deterministic emotion controllable models (including new extended experimental models) leveraging VQ-VAE, VAE and diffusion techniques. We provide an extensive comparative analysis of the experimental models against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. Our evaluation demonstrates that ProbTalk3D-X and original ProbTalk3D achieve superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for visual quality judgment. The entire codebase including the extended models is publicly available.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104358"},"PeriodicalIF":2.8,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1016/j.cag.2025.104404
Zipei Chen , Yumeng Li , Zhong Ren, Yao-Xiang Ding, Kun Zhou
Monocular motion estimation in real scenes is challenging with the presence of noisy and possibly occluded detections. The recent method proposes to introduce a diffusion-based generative motion prior, which treats input detections as noisy partial evidence and generates motion through denoising. This advances robustness and motion quality, yet regardless of whether the denoised motion is close to visual observation, which often causes misalignment. In this work, we propose to reconcile model appearance and motion prior, which enables appearance to play the crucial role of providing reliable noise-free visual evidence for accurate visual alignment. The appearance is modeled by the radiance of both scene and human for joint differentiable rendering. To achieve this with monocular RGB input without mask and depth, we propose a semantic-perturbed mode estimation method to faithfully estimate static scene radiance from dynamic input with complex occlusion relationships, and a polyline depth calibration method to leverage knowledge from depth estimation model to recover the missing depth information. Meanwhile, to leverage knowledge from motion prior and reconcile it with the appearance guidance during optimization, we also propose an occlusion-aware gradient merging strategy. Experimental results demonstrate that our method achieves better-aligned tracking results while maintaining competitive motion quality. Our code is released at https://github.com/Zipei-Chen/Appearance-as-Reliable-Evidence-implementation.
{"title":"Appearance as reliable evidence: Reconciling appearance and generative priors for monocular motion estimation","authors":"Zipei Chen , Yumeng Li , Zhong Ren, Yao-Xiang Ding, Kun Zhou","doi":"10.1016/j.cag.2025.104404","DOIUrl":"10.1016/j.cag.2025.104404","url":null,"abstract":"<div><div>Monocular motion estimation in real scenes is challenging with the presence of noisy and possibly occluded detections. The recent method proposes to introduce a diffusion-based generative motion prior, which treats input detections as noisy partial evidence and generates motion through denoising. This advances robustness and motion quality, yet regardless of whether the denoised motion is close to visual observation, which often causes misalignment. In this work, we propose to reconcile model appearance and motion prior, which enables appearance to play the crucial role of providing reliable noise-free visual evidence for accurate visual alignment. The appearance is modeled by the radiance of both scene and human for joint differentiable rendering. To achieve this with monocular RGB input without mask and depth, we propose a semantic-perturbed mode estimation method to faithfully estimate static scene radiance from dynamic input with complex occlusion relationships, and a polyline depth calibration method to leverage knowledge from depth estimation model to recover the missing depth information. Meanwhile, to leverage knowledge from motion prior and reconcile it with the appearance guidance during optimization, we also propose an occlusion-aware gradient merging strategy. Experimental results demonstrate that our method achieves better-aligned tracking results while maintaining competitive motion quality. Our code is released at <span><span>https://github.com/Zipei-Chen/Appearance-as-Reliable-Evidence-implementation</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104404"},"PeriodicalIF":2.8,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1016/j.cag.2025.104422
Guodong Sun , Dingjie Liu , Zeyu Yang , Shaoran An , Yang Zhang
Traditional 3D reconstruction methods for industrial components present significant limitations. Structured light and laser scanning require costly equipment, complex procedures, and remain sensitive to scan completeness and occlusions. These constraints restrict their application in settings with budget and expertise limitations. Deep learning approaches reduce hardware requirements but fail to accurately reconstruct complex industrial surfaces with real-world data. Industrial components feature intricate geometries and surface irregularities that challenge current deep learning techniques. These methods also demand substantial computational resources, limiting industrial implementation. This paper presents a 3D reconstruction and measurement system based on Gaussian Splatting. The method incorporates adaptive modifications to address the unique surface characteristics of industrial components, ensuring both accuracy and efficiency. To resolve scale and pose discrepancies between the reconstructed Gaussian model and ground truth, a robust scaling and registration pipeline has been developed. This pipeline enables precise evaluation of reconstruction quality and measurement accuracy. Comprehensive experimental evaluations demonstrate that our approach achieves high-precision reconstruction, with an average Chamfer Distance of 2.24 and a mean F1 Score of 0.19, surpassing existing methods. Additionally, the average scale error is 2.41%. The proposed system enables reliable dimensional measurements using only consumer-grade cameras, significantly reducing equipment costs and simplifying operation, thereby improving the accessibility of 3D reconstruction in industrial applications. A publicly available industrial component dataset has been constructed to serve as a benchmark for future research. The dataset and code are available at https://github.com/ldj0o/IndustrialComponentGS.
{"title":"3D reconstruction and precision evaluation of industrial components via Gaussian Splatting","authors":"Guodong Sun , Dingjie Liu , Zeyu Yang , Shaoran An , Yang Zhang","doi":"10.1016/j.cag.2025.104422","DOIUrl":"10.1016/j.cag.2025.104422","url":null,"abstract":"<div><div>Traditional 3D reconstruction methods for industrial components present significant limitations. Structured light and laser scanning require costly equipment, complex procedures, and remain sensitive to scan completeness and occlusions. These constraints restrict their application in settings with budget and expertise limitations. Deep learning approaches reduce hardware requirements but fail to accurately reconstruct complex industrial surfaces with real-world data. Industrial components feature intricate geometries and surface irregularities that challenge current deep learning techniques. These methods also demand substantial computational resources, limiting industrial implementation. This paper presents a 3D reconstruction and measurement system based on Gaussian Splatting. The method incorporates adaptive modifications to address the unique surface characteristics of industrial components, ensuring both accuracy and efficiency. To resolve scale and pose discrepancies between the reconstructed Gaussian model and ground truth, a robust scaling and registration pipeline has been developed. This pipeline enables precise evaluation of reconstruction quality and measurement accuracy. Comprehensive experimental evaluations demonstrate that our approach achieves high-precision reconstruction, with an average Chamfer Distance of 2.24 and a mean F1 Score of 0.19, surpassing existing methods. Additionally, the average scale error is 2.41%. The proposed system enables reliable dimensional measurements using only consumer-grade cameras, significantly reducing equipment costs and simplifying operation, thereby improving the accessibility of 3D reconstruction in industrial applications. A publicly available industrial component dataset has been constructed to serve as a benchmark for future research. The dataset and code are available at <span><span>https://github.com/ldj0o/IndustrialComponentGS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104422"},"PeriodicalIF":2.8,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}