Pub Date : 2025-11-13DOI: 10.1016/j.cag.2025.104488
Demin Liu , Zhou Yang , Hua Wang , Huiyu Li , Fan Zhang
Medical image segmentation plays a critical role in enabling precise visualization and interaction within Extended Reality (XR) environments, which are increasingly used in surgical planning, image-guided interventions, and medical training. Transformer-based architectures have recently become a prominent approach for medical image segmentation due to their ability to capture long-range dependencies through self-attention mechanisms. However, these models often struggle to effectively extract local contextual information that is essential for accurate boundary delineation and fine-grained structure preservation. To address this issue, we propose Multi-Scale Enhanced Spatial Attention Network (MESA-Net), a novel architecture that synergistically combines global attention modeling with localized feature extraction. The network adopts an encoder–decoder structure, where the encoder leverages a pre-trained pyramid vision transformer v2 (PVTv2) to generate rich hierarchical representations. We design a position-aware spatial attention module and a multi-dimensional feature refinement module, which are integrated into the decoder to strengthen local context modeling and refine segmentation outputs. Comprehensive experiments on the Synapse and ACDC datasets demonstrate that MESA-Net achieves state-of-the-art performance, particularly in preserving fine anatomical structures. These improvements in segmentation quality provide a solid foundation for future XR applications, such as real-time interactive visualization and precise 3D reconstruction in clinical scenarios. Our method’s code will be released at: https://github.com/bukeyijuanjuan/MESA-Net.
{"title":"MESA-Net: Multi-Scale Enhanced Spatial Attention Network for medical image segmentation","authors":"Demin Liu , Zhou Yang , Hua Wang , Huiyu Li , Fan Zhang","doi":"10.1016/j.cag.2025.104488","DOIUrl":"10.1016/j.cag.2025.104488","url":null,"abstract":"<div><div>Medical image segmentation plays a critical role in enabling precise visualization and interaction within Extended Reality (XR) environments, which are increasingly used in surgical planning, image-guided interventions, and medical training. Transformer-based architectures have recently become a prominent approach for medical image segmentation due to their ability to capture long-range dependencies through self-attention mechanisms. However, these models often struggle to effectively extract local contextual information that is essential for accurate boundary delineation and fine-grained structure preservation. To address this issue, we propose Multi-Scale Enhanced Spatial Attention Network (MESA-Net), a novel architecture that synergistically combines global attention modeling with localized feature extraction. The network adopts an encoder–decoder structure, where the encoder leverages a pre-trained pyramid vision transformer v2 (PVTv2) to generate rich hierarchical representations. We design a position-aware spatial attention module and a multi-dimensional feature refinement module, which are integrated into the decoder to strengthen local context modeling and refine segmentation outputs. Comprehensive experiments on the Synapse and ACDC datasets demonstrate that MESA-Net achieves state-of-the-art performance, particularly in preserving fine anatomical structures. These improvements in segmentation quality provide a solid foundation for future XR applications, such as real-time interactive visualization and precise 3D reconstruction in clinical scenarios. Our method’s code will be released at: <span><span>https://github.com/bukeyijuanjuan/MESA-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104488"},"PeriodicalIF":2.8,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13DOI: 10.1016/j.cag.2025.104467
Daniele Della Pietra , Nicola Garau
Learning is an iterative process that requires multiple forms of interaction with the environment. During learning, we experience the world through the repetition of observations and actions, gaining an insight into which combination of these leads to the best results, according to our goals. The same paradigm has been applied to traditional reinforcement learning (RL) over the years, with impressive results in 3D navigation and planning. On the other hand, the computer vision community has been focusing mostly on vision-related tasks (e.g. classification, segmentation, depth estimation) using deep learning (DL). We present REP: Render, Encode, Plan, a unified framework to train embodied agents of different kinds (humanoids, vehicles, and drones) inside Unreal Engine, showing how a combination of RL and DL can help to shape intelligent agents that can better sense the surrounding environment. The main advantage of our method is the combination of different sensory modalities, including game state observations and vision features, that allow the agents to share a similar structure in their observations and rewards, while defining separate rewards based on their goals. We demonstrate impressive generalization capabilities on large-scale realistic 3D environments and on multiple dynamically changing scenarios, with different goals and rewards. All code, complete experiments, and environments will be available at https://mmlab-cv.github.io/REP/.
{"title":"Render, Encode, Plan: A simple pipeline for hybrid RL-DL learning inside Unreal Engine","authors":"Daniele Della Pietra , Nicola Garau","doi":"10.1016/j.cag.2025.104467","DOIUrl":"10.1016/j.cag.2025.104467","url":null,"abstract":"<div><div>Learning is an iterative process that requires multiple forms of interaction with the environment. During learning, we experience the world through the repetition of observations and actions, gaining an insight into which combination of these leads to the best results, according to our goals. The same paradigm has been applied to traditional reinforcement learning (RL) over the years, with impressive results in 3D navigation and planning. On the other hand, the computer vision community has been focusing mostly on vision-related tasks (e.g. classification, segmentation, depth estimation) using deep learning (DL). We present <strong>REP: Render, Encode, Plan</strong>, a unified framework to train embodied agents of different kinds (humanoids, vehicles, and drones) inside Unreal Engine, showing how a combination of RL and DL can help to shape intelligent agents that can better sense the surrounding environment. The main advantage of our method is the combination of different sensory modalities, including game state observations and vision features, that allow the agents to share a similar structure in their observations and rewards, while defining separate rewards based on their goals. We demonstrate impressive generalization capabilities on large-scale realistic 3D environments and on multiple dynamically changing scenarios, with different goals and rewards. All code, complete experiments, and environments will be available at <span><span>https://mmlab-cv.github.io/REP/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104467"},"PeriodicalIF":2.8,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.cag.2025.104485
Lingfang Wang , Meiqing Wang , Hang Cheng , Jingyue Wang , Fei Chen
The great success of diffusion models in the text-to-image field has driven the increasing demand for fine-grained local image editing. One of which is changing the viewpoint of objects to given positions in accordance with 3D geometric principles. How to keep the surrounding region unchanged and maintain structural and semantic consistency when editing the designated objects is a challenging yet widely applicable task. However, existing methods often fail to maintain correct geometric structure and editing efficiency simultaneously. To this end, we explore the geometric structure changes of images when the viewpoint changes from the perspective of 3D camera projection and propose a geometry-aware local viewpoint editing approach that requires neither 3D reconstruction nor model training, and performs editing solely at a single timestep in the latent space of diffusion models. Central to our approach is constructing latent-space location mappings across different viewpoints by integrating multi-view geometry theory with the absolute depth information. To address assignment conflicts and latent feature missing problems while enhancing detail fidelity, we design an occlusion reasoning mechanism and a foreground-background aware bilateral interpolation strategy. Additionally, a consistency-preserving strategy is introduced to enhance alignment with the original image. Extensive experiments on image datasets demonstrate the overall advantages of our approach in structural consistency and runtime efficiency.
{"title":"Training-free geometry-aware control for localized image viewpoint editing","authors":"Lingfang Wang , Meiqing Wang , Hang Cheng , Jingyue Wang , Fei Chen","doi":"10.1016/j.cag.2025.104485","DOIUrl":"10.1016/j.cag.2025.104485","url":null,"abstract":"<div><div>The great success of diffusion models in the text-to-image field has driven the increasing demand for fine-grained local image editing. One of which is changing the viewpoint of objects to given positions in accordance with 3D geometric principles. How to keep the surrounding region unchanged and maintain structural and semantic consistency when editing the designated objects is a challenging yet widely applicable task. However, existing methods often fail to maintain correct geometric structure and editing efficiency simultaneously. To this end, we explore the geometric structure changes of images when the viewpoint changes from the perspective of 3D camera projection and propose a geometry-aware local viewpoint editing approach that requires neither 3D reconstruction nor model training, and performs editing solely at a single timestep in the latent space of diffusion models. Central to our approach is constructing latent-space location mappings across different viewpoints by integrating multi-view geometry theory with the absolute depth information. To address assignment conflicts and latent feature missing problems while enhancing detail fidelity, we design an occlusion reasoning mechanism and a foreground-background aware bilateral interpolation strategy. Additionally, a consistency-preserving strategy is introduced to enhance alignment with the original image. Extensive experiments on image datasets demonstrate the overall advantages of our approach in structural consistency and runtime efficiency.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104485"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.cag.2025.104486
Ye Chen, Jian Lu, Jie Zhao, Xiaogai Chen, Kaibing Zhang
In recent years, transformer-based models have demonstrated strong performance in global information extraction. However, in 3D point cloud segmentation, such models still fall short when it comes to capturing local features and accurately identifying geometric and topological relationships. To address the resulting insufficiency in local feature extraction, we propose an enhanced local feature extraction and neighborhood diffusion network for 3D point cloud semantic segmentation (SmartPoints). First, our method aggregates local features from the input point set using a hierarchical feature fusion module (HFF), which enhances information interaction and dependency between different local regions. Next, the dual local topological structure perception module (DLTP) constructs two local topologies using positional and semantic information, respectively. An adaptive dynamic kernel is then designed to capture the mapping between the two local topologies, enhancing local feature representation. To address the challenge of unclear local neighborhood edge distinctions, which often lead to segmentation errors, we design a local neighborhood diffusion module (LND). This module achieves precise edge segmentation by enhancing target region features and suppressing non-target region features. Extensive experiments on benchmark datasets such as S3DIS, ScanNetV2 and SemanticKITTI demonstrate the superior segmentation performance of the proposed SmartPoints.
近年来,基于变压器的模型在全局信息提取中表现出了较强的性能。然而,在三维点云分割中,这种模型在捕捉局部特征和准确识别几何和拓扑关系方面仍然存在不足。为了解决局部特征提取的不足,我们提出了一种增强的局部特征提取和邻域扩散网络,用于3D点云语义分割(SmartPoints)。首先,该方法利用层次特征融合模块(HFF)从输入点集中聚合局部特征,增强了不同局部区域之间的信息交互和依赖关系;其次,双局部拓扑结构感知模块(dual local topology structure perception module, DLTP)分别利用位置信息和语义信息构建两个局部拓扑。然后设计了一个自适应动态核来捕获两个局部拓扑之间的映射,增强了局部特征表示。为了解决局部邻域边缘区分不清导致分割错误的问题,我们设计了一个局部邻域扩散模块(LND)。该模块通过增强目标区域特征和抑制非目标区域特征来实现精确的边缘分割。在S3DIS、ScanNetV2和SemanticKITTI等基准数据集上进行的大量实验证明了所提出的SmartPoints具有优越的分割性能。
{"title":"SmartPoints: Enhanced local feature extraction and neighborhood diffusion network for 3D point cloud semantic segmentation","authors":"Ye Chen, Jian Lu, Jie Zhao, Xiaogai Chen, Kaibing Zhang","doi":"10.1016/j.cag.2025.104486","DOIUrl":"10.1016/j.cag.2025.104486","url":null,"abstract":"<div><div>In recent years, transformer-based models have demonstrated strong performance in global information extraction. However, in 3D point cloud segmentation, such models still fall short when it comes to capturing local features and accurately identifying geometric and topological relationships. To address the resulting insufficiency in local feature extraction, we propose an enhanced local feature extraction and neighborhood diffusion network for 3D point cloud semantic segmentation (SmartPoints). First, our method aggregates local features from the input point set using a hierarchical feature fusion module (HFF), which enhances information interaction and dependency between different local regions. Next, the dual local topological structure perception module (DLTP) constructs two local topologies using positional and semantic information, respectively. An adaptive dynamic kernel is then designed to capture the mapping between the two local topologies, enhancing local feature representation. To address the challenge of unclear local neighborhood edge distinctions, which often lead to segmentation errors, we design a local neighborhood diffusion module (LND). This module achieves precise edge segmentation by enhancing target region features and suppressing non-target region features. Extensive experiments on benchmark datasets such as S3DIS, ScanNetV2 and SemanticKITTI demonstrate the superior segmentation performance of the proposed SmartPoints.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104486"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.cag.2025.104478
Fares Mallek, Carlos Vázquez, Eric Paquette
In this paper, we tackle the challenge of three-dimensional estimation of expressive, animatable, and textured human avatars from a single frontal image. Leveraging a Skinned Multi-Person Linear (SMPL) parametric body model, we adjust the model parameters to faithfully reflect the shape and pose of the individual, relying on the mesh generated by a Pixel-aligned Implicit Function (PIFu) model. To robustly infer the SMPL parameters, we deploy a multi-step optimization process. Initially, we recover the position of 2D joints using an existing pose estimation tool. Subsequently, we utilize the 3D PIFu mesh together with the 2D pose to estimate the 3D position of joints. In the subsequent step, we adapt the body’s parametric model to the 3D joints through rigid alignment, optimizing for global translation and rotation. This step provides a robust initialization for further refinement of shape and pose parameters. The next step involves optimizing the pose and the first component of the SMPL shape parameters while imposing constraints to enhance model robustness. We then refine the SMPL model pose and shape parameters by adding two new registration loss terms to the optimization cost function: a point-to-surface distance and a Chamfer distance. Finally, we introduce a refinement process utilizing a deformation vector field applied to the SMPL mesh, enabling more faithful modeling of tight to loose clothing geometry. As most other works, we optimize based on images of people wearing shoes, resulting in artifacts in the toes region of SMPL. We thus introduce a new shoe-like mesh topology which greatly improves the quality of the reconstructed feet. A notable advantage of our approach is the ability to generate detailed avatars with fewer vertices compared to previous research, enhancing computational efficiency while maintaining high fidelity. We also demonstrate how to gain even more details, while maintaining the advantages of SMPL. To complete our model, we design a texture extraction and completion approach. Our entirely automated approach was evaluated against recognized benchmarks, X-Avatar and PeopleSnapshot, showcasing competitive performance against state-of-the-art methods. This approach contributes to advancing 3D modeling techniques, particularly in the realms of interactive applications, animation, and video games. We will make our code and our improved SMPL mesh topology available to the community: https://github.com/ETS-BodyModeling/ImplicitParametricAvatar.
{"title":"Parametric model fitting for textured and animatable 3D avatar from a single frontal image of a clothed human","authors":"Fares Mallek, Carlos Vázquez, Eric Paquette","doi":"10.1016/j.cag.2025.104478","DOIUrl":"10.1016/j.cag.2025.104478","url":null,"abstract":"<div><div>In this paper, we tackle the challenge of three-dimensional estimation of expressive, animatable, and textured human avatars from a single frontal image. Leveraging a Skinned Multi-Person Linear (SMPL) parametric body model, we adjust the model parameters to faithfully reflect the shape and pose of the individual, relying on the mesh generated by a Pixel-aligned Implicit Function (PIFu) model. To robustly infer the SMPL parameters, we deploy a multi-step optimization process. Initially, we recover the position of 2D joints using an existing pose estimation tool. Subsequently, we utilize the 3D PIFu mesh together with the 2D pose to estimate the 3D position of joints. In the subsequent step, we adapt the body’s parametric model to the 3D joints through rigid alignment, optimizing for global translation and rotation. This step provides a robust initialization for further refinement of shape and pose parameters. The next step involves optimizing the pose and the first component of the SMPL shape parameters while imposing constraints to enhance model robustness. We then refine the SMPL model pose and shape parameters by adding two new registration loss terms to the optimization cost function: a point-to-surface distance and a Chamfer distance. Finally, we introduce a refinement process utilizing a deformation vector field applied to the SMPL mesh, enabling more faithful modeling of tight to loose clothing geometry. As most other works, we optimize based on images of people wearing shoes, resulting in artifacts in the toes region of SMPL. We thus introduce a new shoe-like mesh topology which greatly improves the quality of the reconstructed feet. A notable advantage of our approach is the ability to generate detailed avatars with fewer vertices compared to previous research, enhancing computational efficiency while maintaining high fidelity. We also demonstrate how to gain even more details, while maintaining the advantages of SMPL. To complete our model, we design a texture extraction and completion approach. Our entirely automated approach was evaluated against recognized benchmarks, X-Avatar and PeopleSnapshot, showcasing competitive performance against state-of-the-art methods. This approach contributes to advancing 3D modeling techniques, particularly in the realms of interactive applications, animation, and video games. We will make our code and our improved SMPL mesh topology available to the community: <span><span>https://github.com/ETS-BodyModeling/ImplicitParametricAvatar</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104478"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.cag.2025.104487
Jinhua Liu , Dongjin Huang , Yongsheng Shi , Jiantao Qu
Synthesizing novel views from monocular endoscopic images is challenging due to sparse input views, occlusion of invalid regions, and soft tissue deformation. To tackle these challenges, we propose the neural radiance fields from monocular endoscopic images with dense depth priors, called MonoNeRF-DDP. The algorithm consists of two parts: preprocessing and normative depth-assisted reconstruction. In the preprocessing part, we use labelme to obtain mask images for invalid regions in endoscopy images, preventing their reconstruction. Then, to address the view sparsity problem, we fine-tuned a monocular depth estimation network to predict dense depth maps, enabling the recovery of scene depth information from sparse views during the neural radiance fields optimization process. In the normative depth-assisted reconstruction, to deal with the issues of soft tissue deformation and inaccurate depth information, we adopt neural radiance fields for dynamic scenes to take mask images and dense depth maps as additional inputs and utilize the proposed adaptive loss function to achieve self-supervised training. Experimental results show that MonoNeRF-DDP outperforms the best average values of competing algorithms across the real monocular endoscopic image dataset GastroSynth. MonoNeRF-DDP can reconstruct structurally accurate shapes, fine details, and highly realistic textures with only about 15 input images. Furthermore, a study of 14 medical-related participants indicates that MonoNeRF-DDP can more accurately observe the details of the disease sites and make more reliable preoperative diagnoses.
{"title":"MonoNeRF-DDP: Neural radiance fields from monocular endoscopic images with dense depth priors","authors":"Jinhua Liu , Dongjin Huang , Yongsheng Shi , Jiantao Qu","doi":"10.1016/j.cag.2025.104487","DOIUrl":"10.1016/j.cag.2025.104487","url":null,"abstract":"<div><div>Synthesizing novel views from monocular endoscopic images is challenging due to sparse input views, occlusion of invalid regions, and soft tissue deformation. To tackle these challenges, we propose the neural radiance fields from monocular endoscopic images with dense depth priors, called MonoNeRF-DDP. The algorithm consists of two parts: preprocessing and normative depth-assisted reconstruction. In the preprocessing part, we use labelme to obtain mask images for invalid regions in endoscopy images, preventing their reconstruction. Then, to address the view sparsity problem, we fine-tuned a monocular depth estimation network to predict dense depth maps, enabling the recovery of scene depth information from sparse views during the neural radiance fields optimization process. In the normative depth-assisted reconstruction, to deal with the issues of soft tissue deformation and inaccurate depth information, we adopt neural radiance fields for dynamic scenes to take mask images and dense depth maps as additional inputs and utilize the proposed adaptive loss function to achieve self-supervised training. Experimental results show that MonoNeRF-DDP outperforms the best average values of competing algorithms across the real monocular endoscopic image dataset GastroSynth. MonoNeRF-DDP can reconstruct structurally accurate shapes, fine details, and highly realistic textures with only about 15 input images. Furthermore, a study of 14 medical-related participants indicates that MonoNeRF-DDP can more accurately observe the details of the disease sites and make more reliable preoperative diagnoses.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104487"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1016/j.cag.2025.104484
Boyuan Cheng, Shang Ni, Jian Jun Zhang, Xiaosong Yang
Cinematic camera control is essential for guiding audience attention and conveying narrative intent, yet current data-driven methods largely rely on predefined visual datasets and handcrafted rules, limiting generalization and creativity. This paper introduces a novel diffusion-based framework that generates camera trajectories directly from two-character 3D motion sequences, eliminating the need for paired video–camera annotations. The approach leverages Toric features to encode spatial relations between characters and conditions the diffusion process through a dual-stream motion encoder and interaction module, enabling the camera to adapt dynamically to evolving character interactions. A new dataset linking character motion with camera parameters is constructed to train and evaluate the model. Experiments demonstrate that our method outperforms strong baselines in both quantitative metrics and perceptual quality, producing camera motions that are smooth, temporally coherent, and compositionally consistent with cinematic conventions. This work opens new opportunities for automating virtual cinematography in animation, gaming, and interactive media.
{"title":"Automating visual narratives: Learning cinematic camera perspectives from 3D human interaction","authors":"Boyuan Cheng, Shang Ni, Jian Jun Zhang, Xiaosong Yang","doi":"10.1016/j.cag.2025.104484","DOIUrl":"10.1016/j.cag.2025.104484","url":null,"abstract":"<div><div>Cinematic camera control is essential for guiding audience attention and conveying narrative intent, yet current data-driven methods largely rely on predefined visual datasets and handcrafted rules, limiting generalization and creativity. This paper introduces a novel diffusion-based framework that generates camera trajectories directly from two-character 3D motion sequences, eliminating the need for paired video–camera annotations. The approach leverages Toric features to encode spatial relations between characters and conditions the diffusion process through a dual-stream motion encoder and interaction module, enabling the camera to adapt dynamically to evolving character interactions. A new dataset linking character motion with camera parameters is constructed to train and evaluate the model. Experiments demonstrate that our method outperforms strong baselines in both quantitative metrics and perceptual quality, producing camera motions that are smooth, temporally coherent, and compositionally consistent with cinematic conventions. This work opens new opportunities for automating virtual cinematography in animation, gaming, and interactive media.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104484"},"PeriodicalIF":2.8,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-09DOI: 10.1016/j.cag.2025.104483
Chiara Eva Catalano , Amal Dev Parakkat , Marc Christie
{"title":"Foreword to special section on expressive media","authors":"Chiara Eva Catalano , Amal Dev Parakkat , Marc Christie","doi":"10.1016/j.cag.2025.104483","DOIUrl":"10.1016/j.cag.2025.104483","url":null,"abstract":"","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104483"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-08DOI: 10.1016/j.cag.2025.104473
Bingqing Chen , Wenqi Chu , Xubo Yang , Yue Li
Voice is a powerful medium for conveying personality, emotion, and social presence, yet its role in cultural contexts such as virtual museums remains underexplored. While prior research in virtual reality (VR) has focused on ambient soundscapes or system-driven narration, little is known about what kinds of artifact voices users actually prefer, or if customized voices influence their experience. In this study, we designed a virtual museum and examined user perceptions of three types of voices for artifact chatbots, including a neutral synthetic voice (default), a socially relatable voice (familiar), and a user-customized voice with adjustable elements (customized). Through a within-subjects experiment, we measured user experience with established scales and a semi-structured interview. Results showed a strong user preference for the customized voice, which significantly outperformed the other two conditions. These findings suggest that users not only expect artifacts to speak, but also prefer to have control over the voices, which can enhance their experience and engagement. Our findings provide empirical evidence for the importance of voice customization in virtual museums and lay the groundwork for future design of interactive, user-centered sound and vocal experiences in VR environments.
{"title":"Voice of artifacts: Evaluating user preferences for artifact voice in VR museums","authors":"Bingqing Chen , Wenqi Chu , Xubo Yang , Yue Li","doi":"10.1016/j.cag.2025.104473","DOIUrl":"10.1016/j.cag.2025.104473","url":null,"abstract":"<div><div>Voice is a powerful medium for conveying personality, emotion, and social presence, yet its role in cultural contexts such as virtual museums remains underexplored. While prior research in virtual reality (VR) has focused on ambient soundscapes or system-driven narration, little is known about what kinds of artifact voices users actually prefer, or if customized voices influence their experience. In this study, we designed a virtual museum and examined user perceptions of three types of voices for artifact chatbots, including a neutral synthetic voice (<em>default</em>), a socially relatable voice (<em>familiar</em>), and a user-customized voice with adjustable elements (<em>customized</em>). Through a within-subjects experiment, we measured user experience with established scales and a semi-structured interview. Results showed a strong user preference for the <em>customized</em> voice, which significantly outperformed the other two conditions. These findings suggest that users not only expect artifacts to speak, but also prefer to have control over the voices, which can enhance their experience and engagement. Our findings provide empirical evidence for the importance of voice customization in virtual museums and lay the groundwork for future design of interactive, user-centered sound and vocal experiences in VR environments.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104473"},"PeriodicalIF":2.8,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-08DOI: 10.1016/j.cag.2025.104482
Zhangjin Huang, Bowei Yin
Modeling animatable head avatars from monocular video is a long-standing and challenging problem. Although recent approaches based on 3D Gaussian Splatting (3DGS) have achieved notable progress, the rendered avatars still exhibit several limitations. First, conventional 3DMM priors lack explicit geometric modeling for the eyes and teeth, leading to missing or suboptimal Gaussian initialization in these regions. Second, the heterogeneous characteristics of different facial subregions cause uniform joint training to under-optimize fine-scale details. Third, typical 3DGS issues such as boundary floaters and rendering artifacts remain unresolved in facial Gaussian representations. To address these challenges, we propose Detail Enhancement Gaussian Avatar (DEGA). (1) We augment Gaussian initialization with explicit eye and teeth regions, filling structural gaps left by standard 3DMM-based setups. (2) We introduce a hierarchical Gaussian representation that refines and decomposes the face into semantically aware subregions, enabling more thorough supervision and balanced optimization across all facial areas. (3) We incorporate a learned confidence attribute to suppress unreliable Gaussians, effectively mitigating boundary artifacts and floater phenomena. Overall, DEGA produces lifelike, dynamically expressive head avatars with high-fidelity geometry and appearance. Experiments on public benchmarks demonstrate that our method consistently outperforms state-of-the-art baselines.
{"title":"Detail Enhancement Gaussian Avatar: High-quality head avatars modeling","authors":"Zhangjin Huang, Bowei Yin","doi":"10.1016/j.cag.2025.104482","DOIUrl":"10.1016/j.cag.2025.104482","url":null,"abstract":"<div><div>Modeling animatable head avatars from monocular video is a long-standing and challenging problem. Although recent approaches based on 3D Gaussian Splatting (3DGS) have achieved notable progress, the rendered avatars still exhibit several limitations. First, conventional 3DMM priors lack explicit geometric modeling for the eyes and teeth, leading to missing or suboptimal Gaussian initialization in these regions. Second, the heterogeneous characteristics of different facial subregions cause uniform joint training to under-optimize fine-scale details. Third, typical 3DGS issues such as boundary floaters and rendering artifacts remain unresolved in facial Gaussian representations. To address these challenges, we propose <strong>Detail Enhancement Gaussian Avatar (DEGA)</strong>. (1) We augment Gaussian initialization with explicit eye and teeth regions, filling structural gaps left by standard 3DMM-based setups. (2) We introduce a hierarchical Gaussian representation that refines and decomposes the face into semantically aware subregions, enabling more thorough supervision and balanced optimization across all facial areas. (3) We incorporate a learned confidence attribute to suppress unreliable Gaussians, effectively mitigating boundary artifacts and floater phenomena. Overall, DEGA produces lifelike, dynamically expressive head avatars with high-fidelity geometry and appearance. Experiments on public benchmarks demonstrate that our method consistently outperforms state-of-the-art baselines.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"133 ","pages":"Article 104482"},"PeriodicalIF":2.8,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}