Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3635035
Yudi Zhang, Yeming Geng, Lei Zhang
Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.
{"title":"ScribbleSense: Generative Scribble-Based Texture Editing With Intent Prediction.","authors":"Yudi Zhang, Yeming Geng, Lei Zhang","doi":"10.1109/TVCG.2025.3635035","DOIUrl":"10.1109/TVCG.2025.3635035","url":null,"abstract":"<p><p>Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"2075-2086"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145575104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3646601
Zhuang Chang, Dominik O W Hirschberg, Kunal Gupta, Mehak Sharma, Kangsoo Kim, Huidong Bai, Li Shao, Mark Billinghurst
Mixed Reality Agents (MiRAs) have been extensively studied to enhance virtual-physical interactions, using their ability to exist in both virtual and physical environments. However, little research has focused on enhancing perceived empathy in MiRAs, despite its potential for agent-assisted therapy, education, and training. To fill this gap, we investigate the impact of an Empathic Mixed Reality agent (EMiRA) that adapts to users' physiological states and physical events in a shooting game. We found that this adaptation enhanced users' social perceptions of the agent, including social presence, social connectedness, and perceived empathy. Physiological adaptation increased paternalism and reduced user dominance, while physical adaptation had no such effect. We discuss these findings and provide design implications for future EMiRAs.
{"title":"Enhancing Perceived Empathy in Empathic Mixed Reality Agents via Context-Aware Adaptation.","authors":"Zhuang Chang, Dominik O W Hirschberg, Kunal Gupta, Mehak Sharma, Kangsoo Kim, Huidong Bai, Li Shao, Mark Billinghurst","doi":"10.1109/TVCG.2025.3646601","DOIUrl":"10.1109/TVCG.2025.3646601","url":null,"abstract":"<p><p>Mixed Reality Agents (MiRAs) have been extensively studied to enhance virtual-physical interactions, using their ability to exist in both virtual and physical environments. However, little research has focused on enhancing perceived empathy in MiRAs, despite its potential for agent-assisted therapy, education, and training. To fill this gap, we investigate the impact of an Empathic Mixed Reality agent (EMiRA) that adapts to users' physiological states and physical events in a shooting game. We found that this adaptation enhanced users' social perceptions of the agent, including social presence, social connectedness, and perceived empathy. Physiological adaptation increased paternalism and reduced user dominance, while physical adaptation had no such effect. We discuss these findings and provide design implications for future EMiRAs.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"1569-1581"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3642050
Xiang Li, Wei He, Per Ola Kristensson
As virtual reality (VR) continues to evolve, traditional input methods such as handheld controllers and gesture systems often face challenges with precision, social accessibility, and user fatigue. These limitations motivate the exploration of microgestures, which promise more subtle, ergonomic, and device-free interactions. We introduce microGEXT, a lightweight microgesture-based system designed for text editing in VR without external sensors, which utilizes small, subtle hand movements to reduce physical strain compared to standard gestures. We evaluated microGEXT in three user studies. In Study 1 ($N=20$N=20), microGEXT reduced overall edit time and fatigue compared to a ray-casting + pinch menu baseline, the default text editing approach in commercial VR systems. Study 2 ($N=20$N=20) found that microGEXT performed well in short text selection tasks but was slower for longer text ranges. In Study 3 ($N=10$N=10), participants found microGEXT intuitive for open-ended information-gathering tasks. Across all studies, microGEXT demonstrated enhanced user experience and reduced physical effort, offering a promising alternative to traditional VR text editing techniques.
{"title":"Evaluating the Usability of Microgestures for Text Editing Tasks in Virtual Reality.","authors":"Xiang Li, Wei He, Per Ola Kristensson","doi":"10.1109/TVCG.2025.3642050","DOIUrl":"10.1109/TVCG.2025.3642050","url":null,"abstract":"<p><p>As virtual reality (VR) continues to evolve, traditional input methods such as handheld controllers and gesture systems often face challenges with precision, social accessibility, and user fatigue. These limitations motivate the exploration of microgestures, which promise more subtle, ergonomic, and device-free interactions. We introduce microGEXT, a lightweight microgesture-based system designed for text editing in VR without external sensors, which utilizes small, subtle hand movements to reduce physical strain compared to standard gestures. We evaluated microGEXT in three user studies. In Study 1 ($N=20$N=20), microGEXT reduced overall edit time and fatigue compared to a ray-casting + pinch menu baseline, the default text editing approach in commercial VR systems. Study 2 ($N=20$N=20) found that microGEXT performed well in short text selection tasks but was slower for longer text ranges. In Study 3 ($N=10$N=10), participants found microGEXT intuitive for open-ended information-gathering tasks. Across all studies, microGEXT demonstrated enhanced user experience and reduced physical effort, offering a promising alternative to traditional VR text editing techniques.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"2020-2033"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145746231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3629111
Juncheng Long, Honglei Su, Qi Liu, Hui Yuan, Wei Gao, Jiarun Song, Zhou Wang
No-reference bitstream-layer point cloud quality assessment (PCQA) can be deployed without full decoding at any network node to achieve real-time quality monitoring. In this work, we develop the first PCQA model dedicated to Trisoup-Lifting encoded 3D point clouds by analyzing bitstreams without full decoding. Specifically, we investigate the relationship among texture bitrate per point (TBPP), texture complexity (TC) and texture quantization parameter (TQP) while geometry encoding is lossless. Subsequently, we estimate TC by utilizing TQP and TBPP. Then, we establish a texture distortion evaluation model based on TC, TBPP and TQP. Ultimately, by integrating this texture distortion model with a geometry attenuation factor, a function of trisoupNodeSizeLog2 (tNSL), we acquire a comprehensive NR bitstream-layer PCQA model named streamPCQ-TL. In addition, this work establishes a database named WPC6.0, the first PCQA database dedicated to Trisoup-Lifting encoding mode, encompassing 400 distorted point clouds with 4 geometry multiplied by 5 texture distortion levels. Experiment results on M-PCCD, ICIP2020 and the proposed WPC6.0 database suggest that the proposed streamPCQ-TL model exhibits robust and notable performance in contrast to existing advanced PCQA metrics, particularly in terms of computational cost.
{"title":"Perceptual Quality Assessment of Trisoup-Lifting Encoded 3D Point Clouds.","authors":"Juncheng Long, Honglei Su, Qi Liu, Hui Yuan, Wei Gao, Jiarun Song, Zhou Wang","doi":"10.1109/TVCG.2025.3629111","DOIUrl":"10.1109/TVCG.2025.3629111","url":null,"abstract":"<p><p>No-reference bitstream-layer point cloud quality assessment (PCQA) can be deployed without full decoding at any network node to achieve real-time quality monitoring. In this work, we develop the first PCQA model dedicated to Trisoup-Lifting encoded 3D point clouds by analyzing bitstreams without full decoding. Specifically, we investigate the relationship among texture bitrate per point (TBPP), texture complexity (TC) and texture quantization parameter (TQP) while geometry encoding is lossless. Subsequently, we estimate TC by utilizing TQP and TBPP. Then, we establish a texture distortion evaluation model based on TC, TBPP and TQP. Ultimately, by integrating this texture distortion model with a geometry attenuation factor, a function of trisoupNodeSizeLog2 (tNSL), we acquire a comprehensive NR bitstream-layer PCQA model named streamPCQ-TL. In addition, this work establishes a database named WPC6.0, the first PCQA database dedicated to Trisoup-Lifting encoding mode, encompassing 400 distorted point clouds with 4 geometry multiplied by 5 texture distortion levels. Experiment results on M-PCCD, ICIP2020 and the proposed WPC6.0 database suggest that the proposed streamPCQ-TL model exhibits robust and notable performance in contrast to existing advanced PCQA metrics, particularly in terms of computational cost.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"2034-2048"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145454269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3635138
Peng Du, Xingce Wang, Zhongke Wu, Xudong Ru, Xavier Granier, Ying He
Point cloud denoising is a fundamental yet challenging task in computer graphics. Existing solutions typically rely on supervised training on synthesized noise. However, real-world noise often exhibits greater complexity, causing learning-based methods trained on synthetic noise to struggle when encountering unseen noise-a phenomenon we refer to as noise misalignment. To address this challenge, we propose LaPDA (Latent-space Point cloud Denoising with Adaptivity), a neural network explicitly designed to mitigate noise misalignment and enhance denoising robustness. LaPDA consists of two key stages. First, we adaptively model noise in the latent space, aligning unseen noise distributions with the known training distributions or adjusting them toward distributions with lower noise scales. Training objectives at this stage are formulated based on controlled synthetic noise with varying intensity levels. Second, we introduce a gradual noise removal module that optimizes the spatial distribution of the adaptively adjusted noisy points. Extensive experiments conducted on both synthetic and scanned datasets demonstrate that LaPDA achieves enhanced accuracy and robustness compared to state-of-the-art methods.
点云去噪是计算机图形学中一项基本而又具有挑战性的任务。现有的解决方案通常依赖于对合成噪声的监督训练。然而,现实世界的噪声往往表现出更大的复杂性,导致在合成噪声上训练的基于学习的方法在遇到看不见的噪声时会遇到困难——我们将这种现象称为噪声失调。为了应对这一挑战,我们提出了LaPDA (Latent-space Point cloud Denoising with Adaptivity),这是一种明确设计用于减轻噪声失调和增强去噪鲁棒性的神经网络。LaPDA包括两个关键阶段。首先,我们在潜在空间中自适应地对噪声建模,将未见的噪声分布与已知的训练分布对齐,或将其调整为具有更低噪声尺度的分布。这一阶段的训练目标是根据不同强度水平的受控合成噪声制定的。其次,我们引入了一个渐进的噪声去除模块,优化自适应调整噪声点的空间分布。在合成和扫描数据集上进行的大量实验表明,与最先进的方法相比,LaPDA实现了更高的准确性和鲁棒性。我们将公开源代码和测试模型。
{"title":"LaPDA: Latent-Space Point Cloud Denoising With Adaptivity.","authors":"Peng Du, Xingce Wang, Zhongke Wu, Xudong Ru, Xavier Granier, Ying He","doi":"10.1109/TVCG.2025.3635138","DOIUrl":"10.1109/TVCG.2025.3635138","url":null,"abstract":"<p><p>Point cloud denoising is a fundamental yet challenging task in computer graphics. Existing solutions typically rely on supervised training on synthesized noise. However, real-world noise often exhibits greater complexity, causing learning-based methods trained on synthetic noise to struggle when encountering unseen noise-a phenomenon we refer to as noise misalignment. To address this challenge, we propose LaPDA (Latent-space Point cloud Denoising with Adaptivity), a neural network explicitly designed to mitigate noise misalignment and enhance denoising robustness. LaPDA consists of two key stages. First, we adaptively model noise in the latent space, aligning unseen noise distributions with the known training distributions or adjusting them toward distributions with lower noise scales. Training objectives at this stage are formulated based on controlled synthetic noise with varying intensity levels. Second, we introduce a gradual noise removal module that optimizes the spatial distribution of the adaptively adjusted noisy points. Extensive experiments conducted on both synthetic and scanned datasets demonstrate that LaPDA achieves enhanced accuracy and robustness compared to state-of-the-art methods.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"1525-1539"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145574754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Users are often seated in the real environment, while their virtual avatars either remain standing stationary or move in virtual reality (VR). This creates posture inconsistencies between the real and virtual embodiment representations. The relationship between posture consistency in locomotion techniques and sense of presence in VR is still unclear. This study investigates how visual and somatosensory integration affects the sense of standing (SoSt) and the sense of self-motion (SoSm) when the sitting posture is varied slightly, including highlighting the importance of sitting posture for locomotion design in VR. The degree and occurrence of SoSt and SoSm were assessed by subjective experiments, and it was found that higher sitting and lower sitting postures present higher SoSt and lower SoSm, respectively. Invocation of SoSt also influences postural perception. Perception of travel distance varied according to the posture condition when identical visual flow was presented. The findings suggest that visual and somatosensory integration related to posture enhances SoSt and SoSm, and a sitting posture with a higher seating position is recommended in seated VR locomotion design.
{"title":"Visual and Somatosensory Integration With Higher Sitting Posture Enhances the Sense of Standing and Self-Motion in Seated VR.","authors":"Daiki Hagimori, Naoya Isoyama, Monica Perusquia-Hernandez, Shunsuke Yoshimoto, Hideaki Uchiyama, Nobuchika Sakata, Kiyoshi Kiyokawa","doi":"10.1109/TVCG.2025.3640239","DOIUrl":"10.1109/TVCG.2025.3640239","url":null,"abstract":"<p><p>Users are often seated in the real environment, while their virtual avatars either remain standing stationary or move in virtual reality (VR). This creates posture inconsistencies between the real and virtual embodiment representations. The relationship between posture consistency in locomotion techniques and sense of presence in VR is still unclear. This study investigates how visual and somatosensory integration affects the sense of standing (SoSt) and the sense of self-motion (SoSm) when the sitting posture is varied slightly, including highlighting the importance of sitting posture for locomotion design in VR. The degree and occurrence of SoSt and SoSm were assessed by subjective experiments, and it was found that higher sitting and lower sitting postures present higher SoSt and lower SoSm, respectively. Invocation of SoSt also influences postural perception. Perception of travel distance varied according to the posture condition when identical visual flow was presented. The findings suggest that visual and somatosensory integration related to posture enhances SoSt and SoSm, and a sitting posture with a higher seating position is recommended in seated VR locomotion design.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"1767-1779"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3620400
Yatian Wang, Haoran Mo, Chengying Gao
To address the issue of style expression in existing text-driven human motion synthesis methods, we propose DiFusion, a framework for diversely stylized motion generation. It offers flexible control of content through texts and style via multiple modalities, i.e., textual labels or motion sequences. Our approach employs a dual-condition motion latent diffusion model, enabling independent control of content and style through flexible input modalities. To tackle the issue of imbalanced complexity between the text-motion and style-motion datasets, we propose the Digest-and-Fusion training scheme, which digests domain-specific knowledge from both datasets and then adaptively fuses them into a compatible manner. Comprehensive evaluations demonstrate the effectiveness of our method and its superiority over existing approaches in terms of content alignment, style expressiveness, realism, and diversity. Additionally, our approach can be extended to practical applications, such as motion style interpolation.
{"title":"DiFusion: Flexible Stylized Motion Generation Using Digest-and-Fusion Scheme.","authors":"Yatian Wang, Haoran Mo, Chengying Gao","doi":"10.1109/TVCG.2025.3620400","DOIUrl":"10.1109/TVCG.2025.3620400","url":null,"abstract":"<p><p>To address the issue of style expression in existing text-driven human motion synthesis methods, we propose DiFusion, a framework for diversely stylized motion generation. It offers flexible control of content through texts and style via multiple modalities, i.e., textual labels or motion sequences. Our approach employs a dual-condition motion latent diffusion model, enabling independent control of content and style through flexible input modalities. To tackle the issue of imbalanced complexity between the text-motion and style-motion datasets, we propose the Digest-and-Fusion training scheme, which digests domain-specific knowledge from both datasets and then adaptively fuses them into a compatible manner. Comprehensive evaluations demonstrate the effectiveness of our method and its superiority over existing approaches in terms of content alignment, style expressiveness, realism, and diversity. Additionally, our approach can be extended to practical applications, such as motion style interpolation.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"1593-1604"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3617147
Junlong Chen, Rosella P Galindo Esparza, Vanja Garaj, Per Ola Kristensson, John Dudley
Effective visual accessibility in Virtual Reality (VR) is crucial for Blind and Low Vision (BLV) users. However, designing visual accessibility systems is challenging due to the complexity of 3D VR environments and the need for techniques that can be easily retrofitted into existing applications. While prior work has studied how to enhance or translate visual information, the advancement of Vision Language Models (VLMs) provides an exciting opportunity to advance the scene interpretation capability of current systems. This paper presents EnVisionVR, an accessibility tool for VR scene interpretation. Through a formative study of usability barriers, we confirmed the lack of visual accessibility features as a key barrier for BLV users of VR content and applications. In response, we used our findings from the formative study to inform the design and development of EnVisionVR, a novel visual accessibility system leveraging a VLM, voice input and multimodal feedback for scene interpretation and virtual object interaction in VR. An evaluation with 12 BLV users demonstrated that EnVisionVR significantly improved their ability to locate virtual objects, effectively supporting scene understanding and object interaction.
{"title":"EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality.","authors":"Junlong Chen, Rosella P Galindo Esparza, Vanja Garaj, Per Ola Kristensson, John Dudley","doi":"10.1109/TVCG.2025.3617147","DOIUrl":"10.1109/TVCG.2025.3617147","url":null,"abstract":"<p><p>Effective visual accessibility in Virtual Reality (VR) is crucial for Blind and Low Vision (BLV) users. However, designing visual accessibility systems is challenging due to the complexity of 3D VR environments and the need for techniques that can be easily retrofitted into existing applications. While prior work has studied how to enhance or translate visual information, the advancement of Vision Language Models (VLMs) provides an exciting opportunity to advance the scene interpretation capability of current systems. This paper presents EnVisionVR, an accessibility tool for VR scene interpretation. Through a formative study of usability barriers, we confirmed the lack of visual accessibility features as a key barrier for BLV users of VR content and applications. In response, we used our findings from the formative study to inform the design and development of EnVisionVR, a novel visual accessibility system leveraging a VLM, voice input and multimodal feedback for scene interpretation and virtual object interaction in VR. An evaluation with 12 BLV users demonstrated that EnVisionVR significantly improved their ability to locate virtual objects, effectively supporting scene understanding and object interaction.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"2007-2019"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145240653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3621585
Yilei Chen, Ping An, Xinpeng Huang, Qiang Wu
Generalizable NeRF synthesizes novel views of unseen scenes without per-scene training. The view-epipolar transformer has become popular in this field for its ability to produce high-quality views. Existing methods with this architecture rely on the assumption that texture consistency across views can identify object surfaces, with such identification crucial for determining where to reconstruct texture. However, this assumption is not always valid, as different surface positions may share similar texture features, creating ambiguity in surface identification. To handle this ambiguity, this paper introduces 3D volume features into the view-epipolar transformer. These features contain geometric information, which will be a supplement to texture features. By incorporating both texture and geometric cues in consistency measurement, our method mitigates the ambiguity in surface detection. This leads to more accurate surfaces and thus better novel view synthesis. Additionally, we propose a decoupled decoder where volume and texture features are used for density and color prediction respectively. In this way, the two properties can be better predicted without mutual interference. Experiments show improved results over existing transformer-based methods on both real-world and synthetic datasets.
{"title":"Volume Feature Aware View-Epipolar Transformers for Generalizable NeRF.","authors":"Yilei Chen, Ping An, Xinpeng Huang, Qiang Wu","doi":"10.1109/TVCG.2025.3621585","DOIUrl":"10.1109/TVCG.2025.3621585","url":null,"abstract":"<p><p>Generalizable NeRF synthesizes novel views of unseen scenes without per-scene training. The view-epipolar transformer has become popular in this field for its ability to produce high-quality views. Existing methods with this architecture rely on the assumption that texture consistency across views can identify object surfaces, with such identification crucial for determining where to reconstruct texture. However, this assumption is not always valid, as different surface positions may share similar texture features, creating ambiguity in surface identification. To handle this ambiguity, this paper introduces 3D volume features into the view-epipolar transformer. These features contain geometric information, which will be a supplement to texture features. By incorporating both texture and geometric cues in consistency measurement, our method mitigates the ambiguity in surface detection. This leads to more accurate surfaces and thus better novel view synthesis. Additionally, we propose a decoupled decoder where volume and texture features are used for density and color prediction respectively. In this way, the two properties can be better predicted without mutual interference. Experiments show improved results over existing transformer-based methods on both real-world and synthetic datasets.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"2049-2060"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145294710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1109/TVCG.2025.3642300
Shaoxu Li, Chuhang Ma, Ye Pan
Zero-shot text-to-video diffusion models are crafted to expand pre-trained image diffusion models to the video domain without additional training. In recent times, prevailing techniques commonly rely on existing shapes as constraints and introduce inter-frame attention to ensure texture consistency. However, such shape constraints tend to restrict the stylized geometric deformation of videos and inadvertently neglect the original texture characteristics. Furthermore, existing methods suffer from flickering and inconsistent facial expressions. In this paper, we present DiffPortraitVideo. The framework employs a diffusion model-based feature and attention injection mechanism to generate key frames, with cross-frame constraints to enforce coherence and adaptive feature fusion to ensure expression consistency. Our approach achieves high spatio-temporal and expression consistency while retaining the textual and original image properties. Extensive and comprehensive experiments are conducted to validate the efficacy of our proposed framework in generating personalized, high-quality, and coherent videos. This not only showcases the superiority of our method over existing approaches but also paves the way for further research and development in the field of text-to-video generation with enhanced personalization and quality.
{"title":"DiffPortraitVideo: Diffusion-Based Expression-Consistent Zero-Shot Portrait Video Translation.","authors":"Shaoxu Li, Chuhang Ma, Ye Pan","doi":"10.1109/TVCG.2025.3642300","DOIUrl":"10.1109/TVCG.2025.3642300","url":null,"abstract":"<p><p>Zero-shot text-to-video diffusion models are crafted to expand pre-trained image diffusion models to the video domain without additional training. In recent times, prevailing techniques commonly rely on existing shapes as constraints and introduce inter-frame attention to ensure texture consistency. However, such shape constraints tend to restrict the stylized geometric deformation of videos and inadvertently neglect the original texture characteristics. Furthermore, existing methods suffer from flickering and inconsistent facial expressions. In this paper, we present DiffPortraitVideo. The framework employs a diffusion model-based feature and attention injection mechanism to generate key frames, with cross-frame constraints to enforce coherence and adaptive feature fusion to ensure expression consistency. Our approach achieves high spatio-temporal and expression consistency while retaining the textual and original image properties. Extensive and comprehensive experiments are conducted to validate the efficacy of our proposed framework in generating personalized, high-quality, and coherent videos. This not only showcases the superiority of our method over existing approaches but also paves the way for further research and development in the field of text-to-video generation with enhanced personalization and quality.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":"1656-1667"},"PeriodicalIF":6.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145727816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}