首页 > 最新文献

Displays最新文献

英文 中文
RAP-SORT: Advanced Multi-Object Tracking for complex scenarios RAP-SORT:用于复杂场景的高级多目标跟踪
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-21 DOI: 10.1016/j.displa.2026.103361
Shuming Zhang , Yuhang Zhu , Yanhui Sun , Weiyong Liu , Zhangjin Huang
Multi-Object Tracking (MOT) aims to detect and associate objects across frames while maintaining consistent IDs. While some approaches leverage both strong and weak cues alongside camera compensation to improve association, they struggle in scenarios involving high object density or nonlinear motion. To address these challenges, we propose RAP-SORT, a novel MOT framework that introduces four key innovations. First, the Robust Tracklet Confidence Modeling (RTCM) module models trajectory confidence by smoothing updates and applying second-order difference adjustments for low-confidence cases. Second, the Advanced Observation-Centric Recovery (AOCR) module facilitates trajectory recovery via linear interpolation and backtracking. Third, the Pseudo-Depth IoU (PDIoU) metric integrates height and depth cues into IoU calculations for enhanced spatial awareness. Finally, the Window Denoising (WD) module is tailored for the DanceTrack dataset, effectively mitigating the creation of new tracks caused by misdetections. RAP-SORT sets a new state-of-the-art on the DanceTrack and MOT20 benchmarks, achieving HOTA scores of 66.7 and 64.2, surpassing the previous best by 1.0 and 0.3, respectively, while also delivering competitive performance on MOT17. Code and models will be available soon at https://github.com/levi5611/RAP-SORT.
多目标跟踪(MOT)旨在检测和关联跨帧的对象,同时保持一致的id。虽然有些方法同时利用强和弱线索以及相机补偿来改善关联,但它们在涉及高物体密度或非线性运动的场景中很困难。为了应对这些挑战,我们提出了RAP-SORT,这是一个新的MOT框架,引入了四个关键创新。首先,鲁棒Tracklet置信度建模(RTCM)模块通过平滑更新和对低置信度情况应用二阶差分调整来建模轨迹置信度。其次,先进的以观测为中心的恢复(AOCR)模块通过线性插值和回溯实现轨迹恢复。第三,伪深度IoU (PDIoU)度量将高度和深度线索集成到IoU计算中,以增强空间感知。最后,窗口去噪(WD)模块是为DanceTrack数据集量身定制的,有效地减少了由于误检测引起的新轨道的创建。RAP-SORT在DanceTrack和MOT20基准上取得了新的先进水平,分别比之前的最佳成绩高出1.0分和0.3分,达到了66.7分和64.2分,同时在MOT17上也表现出了竞争力。代码和模型将很快在https://github.com/levi5611/RAP-SORT上提供。
{"title":"RAP-SORT: Advanced Multi-Object Tracking for complex scenarios","authors":"Shuming Zhang ,&nbsp;Yuhang Zhu ,&nbsp;Yanhui Sun ,&nbsp;Weiyong Liu ,&nbsp;Zhangjin Huang","doi":"10.1016/j.displa.2026.103361","DOIUrl":"10.1016/j.displa.2026.103361","url":null,"abstract":"<div><div>Multi-Object Tracking (MOT) aims to detect and associate objects across frames while maintaining consistent IDs. While some approaches leverage both strong and weak cues alongside camera compensation to improve association, they struggle in scenarios involving high object density or nonlinear motion. To address these challenges, we propose RAP-SORT, a novel MOT framework that introduces four key innovations. First, the Robust Tracklet Confidence Modeling (RTCM) module models trajectory confidence by smoothing updates and applying second-order difference adjustments for low-confidence cases. Second, the Advanced Observation-Centric Recovery (AOCR) module facilitates trajectory recovery via linear interpolation and backtracking. Third, the Pseudo-Depth IoU (PDIoU) metric integrates height and depth cues into IoU calculations for enhanced spatial awareness. Finally, the Window Denoising (WD) module is tailored for the DanceTrack dataset, effectively mitigating the creation of new tracks caused by misdetections. RAP-SORT sets a new state-of-the-art on the DanceTrack and MOT20 benchmarks, achieving HOTA scores of 66.7 and 64.2, surpassing the previous best by 1.0 and 0.3, respectively, while also delivering competitive performance on MOT17. Code and models will be available soon at <span><span>https://github.com/levi5611/RAP-SORT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103361"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ctp2Fic: From coarse-grained token pruning to fine-grained token clustering for LVLM inference acceleration Ctp2Fic:从粗粒度令牌修剪到细粒度令牌聚类,用于LVLM推理加速
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-21 DOI: 10.1016/j.displa.2026.103360
Yulong Lei , Zishuo Wang , Jinglin Xu , Yuxin Peng
Large Vision–Language Models (LVLMs) excel in multimodal tasks, but their high computational cost, driven by the large number of image tokens, severely limits inference efficiency. While existing training-free methods reduce token counts to accelerate inference, they often struggle to preserve model performance. This trade-off between efficiency and accuracy poses the key challenge in accelerating Large Vision–Language Model (LVLM) inference without retraining. In this paper, we analyze the rank of attention matrices across layers and discover that image token redundancy peaks in two specific VLM layers: many tokens convey nearly identical information, yet still participate in subsequent computations. Leveraging this insight, we propose Ctp2Fic, a new two-stage coarse-to-fine token compression framework. Specifically, in the Coarse-grained Text-guided Pruning stage, we dynamically assign a weight to each visual token based on its semantic relevance to the input instruction and prune low-weight tokens that are unrelated to the task. During the Fine-grained Image-based Clustering stage, we apply a lightweight clustering algorithm to merge semantically similar tokens into compact, representative ones, thus further reducing the sequence length. Our framework requires no model fine-tuning and seamlessly integrates into existing LVLM inference pipelines. Extensive experiments demonstrate that Ctp2Fic outperforms state-of-the-art acceleration techniques in both inference speed and accuracy, achieving superior efficiency and performance without retraining.
大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中表现优异,但由于大量图像标记的驱动,其计算成本高,严重限制了推理效率。虽然现有的无训练方法减少令牌计数以加速推理,但它们往往难以保持模型性能。这种效率和准确性之间的权衡是在不重新训练的情况下加速大型视觉语言模型(LVLM)推理的关键挑战。在本文中,我们分析了各层关注矩阵的秩,发现图像令牌冗余在两个特定的VLM层中达到峰值:许多令牌传递几乎相同的信息,但仍然参与后续的计算。利用这一见解,我们提出了Ctp2Fic,这是一个新的两阶段从粗到精的令牌压缩框架。具体而言,在粗粒度文本引导修剪阶段,我们根据每个视觉标记与输入指令的语义相关性动态分配权重,并修剪与任务无关的低权重标记。在基于细粒度图像的聚类阶段,我们应用轻量级聚类算法将语义相似的令牌合并为紧凑的、有代表性的令牌,从而进一步减少序列长度。我们的框架不需要模型微调,可以无缝地集成到现有的LVLM推理管道中。大量实验表明,Ctp2Fic在推理速度和准确性方面都优于最先进的加速技术,无需再训练即可实现卓越的效率和性能。
{"title":"Ctp2Fic: From coarse-grained token pruning to fine-grained token clustering for LVLM inference acceleration","authors":"Yulong Lei ,&nbsp;Zishuo Wang ,&nbsp;Jinglin Xu ,&nbsp;Yuxin Peng","doi":"10.1016/j.displa.2026.103360","DOIUrl":"10.1016/j.displa.2026.103360","url":null,"abstract":"<div><div>Large Vision–Language Models (LVLMs) excel in multimodal tasks, but their high computational cost, driven by the large number of image tokens, severely limits inference efficiency. While existing training-free methods reduce token counts to accelerate inference, they often struggle to preserve model performance. This trade-off between efficiency and accuracy poses the key challenge in accelerating Large Vision–Language Model (LVLM) inference without retraining. In this paper, we analyze the rank of attention matrices across layers and discover that image token redundancy peaks in two specific VLM layers: many tokens convey nearly identical information, yet still participate in subsequent computations. Leveraging this insight, we propose Ctp2Fic, a new two-stage coarse-to-fine token compression framework. Specifically, in the Coarse-grained Text-guided Pruning stage, we dynamically assign a weight to each visual token based on its semantic relevance to the input instruction and prune low-weight tokens that are unrelated to the task. During the Fine-grained Image-based Clustering stage, we apply a lightweight clustering algorithm to merge semantically similar tokens into compact, representative ones, thus further reducing the sequence length. Our framework requires no model fine-tuning and seamlessly integrates into existing LVLM inference pipelines. Extensive experiments demonstrate that Ctp2Fic outperforms state-of-the-art acceleration techniques in both inference speed and accuracy, achieving superior efficiency and performance without retraining.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103360"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MeAP: dual level memory strategy augmented transformer based visual object predictor MeAP:基于视觉对象预测器的双级存储策略增强变压器
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-20 DOI: 10.1016/j.displa.2026.103356
Shiliang Yan, Yinling Wang, Dandan Lu, Min Wang
The exploration and resolution of persistent noise incursions within the tracking sequences, especially the occlusion, illumination variations, and fast motion, have garnered substantial attention for their functional properties in enhancing the accuracy and robustness of visual object trackers. However, existing visual object trackers, equipped with template updating mechanisms or calibration strategy, heavily rely on time-consuming historical data to achieve optimal tracking performance, impeding their real-time tracking capabilities. To address these challenges, this paper introduces a long-short term dual level memory augmented transformer structure aided visual object predictor (MeAP). The key contributions of MeAP can be summarized as follows: 1) the formulation of a noise model for specific invasion events based on incursion effects and corresponding template strategies serving as the foundation for more efficient memory utilization; 2) The memory exploration scheme based online tracking mask-based feature extraction strategy and the transformer architecture is introduced to mitigate the impact of noise invasion during memory vector construction; 3) the memory utilization scheme based target basic feature and dual feature target mask predictor is provided to implement the scene-edge feature for mask-based feature extraction method and jointly predict the accurate location of the tracking target.. Extensive experiments conducted on OTB100, NFS, VOT2021, and AVisT benchmarks demonstrate that MeAP, with its introduced modules, achieves comparable tracking performances against other state-of-the-art (SOTA) trackers, and operates at an average speed of 31 frames per second (FPS) across 4 benchmarks.
跟踪序列中持续噪声入侵的探索和解决,特别是遮挡、光照变化和快速运动,因其在提高视觉目标跟踪器的准确性和鲁棒性方面的功能特性而获得了大量关注。然而,现有的视觉目标跟踪器大多采用模板更新机制或校准策略,严重依赖耗时的历史数据来实现最佳跟踪性能,阻碍了其实时跟踪能力。为了解决这些问题,本文介绍了一种长短期双电平记忆增强变压器结构辅助视觉对象预测器(MeAP)。MeAP的主要贡献包括:1)建立了基于入侵效应的特定入侵事件的噪声模型和相应的模板策略,为更有效地利用内存奠定了基础;2)引入基于在线跟踪掩模的特征提取策略和变压器结构的记忆探索方案,减轻了记忆向量构建过程中噪声入侵的影响;3)提供基于目标基本特征和双特征目标掩码预测器的内存利用方案,实现基于掩码的特征提取方法的场景边缘特征,共同预测跟踪目标的准确位置。在OTB100、NFS、VOT2021和AVisT基准测试上进行的大量实验表明,MeAP及其引入的模块实现了与其他最先进(SOTA)跟踪器相当的跟踪性能,并且在4个基准测试中以每秒31帧(FPS)的平均速度运行。
{"title":"MeAP: dual level memory strategy augmented transformer based visual object predictor","authors":"Shiliang Yan,&nbsp;Yinling Wang,&nbsp;Dandan Lu,&nbsp;Min Wang","doi":"10.1016/j.displa.2026.103356","DOIUrl":"10.1016/j.displa.2026.103356","url":null,"abstract":"<div><div>The exploration and resolution of persistent noise incursions within the tracking sequences, especially the occlusion, illumination variations, and fast motion, have garnered substantial attention for their functional properties in enhancing the accuracy and robustness of visual object trackers. However, existing visual object trackers, equipped with template updating mechanisms or calibration strategy, heavily rely on time-consuming historical data to achieve optimal tracking performance, impeding their real-time tracking capabilities. To address these challenges, this paper introduces a long-short term dual level memory augmented transformer structure aided visual object predictor (MeAP). The key contributions of MeAP can be summarized as follows: 1) the formulation of a noise model for specific invasion events based on incursion effects and corresponding template strategies serving as the foundation for more efficient memory utilization; 2) The memory exploration scheme based online tracking mask-based feature extraction strategy and the transformer architecture is introduced to mitigate the impact of noise invasion during memory vector construction; 3) the memory utilization scheme based target basic feature and dual feature target mask predictor is provided to implement the scene-edge feature for mask-based feature extraction method and jointly predict the accurate location of the tracking target.. Extensive experiments conducted on OTB100, NFS, VOT2021, and AVisT benchmarks demonstrate that MeAP, with its introduced modules, achieves comparable tracking performances against other state-of-the-art (SOTA) trackers, and operates at an average speed of 31 frames per second (FPS) across 4 benchmarks.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103356"},"PeriodicalIF":3.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interactive feature pyramid network for small object detection in UAV aerial images 基于交互特征金字塔网络的无人机航拍图像小目标检测
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-17 DOI: 10.1016/j.displa.2026.103352
Jinhang Zhang, LiQiang Song, Min Gao, Wenzhao Li, Zhuang Wei
The high prevalence of small objects in aerial images presents a significant challenge for object detection tasks. In this paper, we propose the Interactive Feature Pyramid Network (IFPN) specifically for small object detection in aerial images. The IFPN architecture comprises an Interactive Channel-Wise Attention (ICA) module and an Interactive Spatial-Wise Attention (ISA) module. The ICA and ISA modules facilitate feature interaction across multiple layers, thereby mitigating semantic gaps and information loss inherent in traditional feature pyramids, and effectively capturing the detailed features essential for small objects. By incorporating global contextual information, IFPN enhances the model’s ability to discern the relationship between the target and its surrounding context, particularly in scenarios where small objects exhibit limited features, thereby significantly improving the accuracy of small object detection. Additionally, we propose an Attention Convolution Module (ACM) designed to furnish high-quality feature bases for IFPN during its early stages. Extensive experiments conducted on aerial image datasets attest to the effectiveness and sophistication of IFPN for detecting small objects within aerial images.
航拍图像中小目标的高普遍性给目标检测任务带来了重大挑战。在本文中,我们提出了交互式特征金字塔网络(IFPN),专门用于航空图像中的小目标检测。IFPN架构包括ICA (Interactive Channel-Wise Attention)模块和ISA (Interactive Spatial-Wise Attention)模块。ICA和ISA模块促进了多层特征交互,从而减轻了传统特征金字塔固有的语义缺口和信息丢失,并有效地捕获了小对象所需的详细特征。通过整合全局上下文信息,IFPN增强了模型识别目标与其周围环境之间关系的能力,特别是在小物体表现出有限特征的情况下,从而显着提高了小物体检测的准确性。此外,我们提出了一个注意力卷积模块(ACM),旨在为IFPN的早期阶段提供高质量的特征库。在航空图像数据集上进行的大量实验证明了IFPN在航空图像中检测小物体的有效性和复杂性。
{"title":"Interactive feature pyramid network for small object detection in UAV aerial images","authors":"Jinhang Zhang,&nbsp;LiQiang Song,&nbsp;Min Gao,&nbsp;Wenzhao Li,&nbsp;Zhuang Wei","doi":"10.1016/j.displa.2026.103352","DOIUrl":"10.1016/j.displa.2026.103352","url":null,"abstract":"<div><div>The high prevalence of small objects in aerial images presents a significant challenge for object detection tasks. In this paper, we propose the Interactive Feature Pyramid Network (IFPN) specifically for small object detection in aerial images. The IFPN architecture comprises an Interactive Channel-Wise Attention (ICA) module and an Interactive Spatial-Wise Attention (ISA) module. The ICA and ISA modules facilitate feature interaction across multiple layers, thereby mitigating semantic gaps and information loss inherent in traditional feature pyramids, and effectively capturing the detailed features essential for small objects. By incorporating global contextual information, IFPN enhances the model’s ability to discern the relationship between the target and its surrounding context, particularly in scenarios where small objects exhibit limited features, thereby significantly improving the accuracy of small object detection. Additionally, we propose an Attention Convolution Module (ACM) designed to furnish high-quality feature bases for IFPN during its early stages. Extensive experiments conducted on aerial image datasets attest to the effectiveness and sophistication of IFPN for detecting small objects within aerial images.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103352"},"PeriodicalIF":3.4,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PGgraf: Pose-Guided generative radiance field for novel-views on X-ray PGgraf:姿态引导生成辐射场在x射线上的新观点
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-17 DOI: 10.1016/j.displa.2026.103354
Hangyu Li , Moquan Liu , Nan Wang, Mengcheng Sun, Yu Zhu
In clinical diagnosis, doctors usually judge the information by a few X-rays to avoid excessive ionizing radiation from harming the patient. The recent Neural Radiance Field (NERF) technology contemplates generating novel-views from a single X-ray to assist physicians in diagnosis. In this task, we consider two advantages of X-ray filming over natural images: (1) The medical equipment is fixed, and there is a standardized filming pose. (2) There is an apparent structural prior to X-rays of the same body part at the same pose. Based on such conditions, we propose a Pose-Guided generative radiance field (PGgraf) containing a generator and discriminator. In the training phase, the discriminator combines the image features with two kinds of pose information (ray direction set and camera angle) to guide the generator to synthesize X-rays consistent with the realistic view. In the generator, we design a Density Reconstruction Block (DRB). Unlike the original NERF, which directly estimates the particle density based on the particle positions, the DRB considers all the particle features sampled in a ray and integrally predicts the density of each particle. Experiments comparing qualitative–quantitative on two chest datasets and one knee dataset with state-of-the-art NERF schemes show that PGgraf has a clear advantage in inferring novel-views at different ranges. In the three ranges of 0°to 360°, −15°to 15°, and 75°to 105°, the Peak Signal-to-Noise Ratio (PSNR) improved by an average of 4.18 decibel, and the Learned Perceptual Image Patch Similarity (LPIPS) improved by an average of 50.7%.
在临床诊断中,医生通常通过少量的x射线来判断信息,以避免过多的电离辐射对患者造成伤害。最近的神经辐射场(NERF)技术考虑从单个x射线产生新的视图,以帮助医生诊断。在本任务中,我们考虑了x射线拍摄相对于自然图像的两个优点:(1)医疗设备是固定的,并且有一个标准化的拍摄姿势。(2)在同一姿势下,同一身体部位的x光前有明显的结构。基于这些条件,我们提出了一种包含生成器和鉴别器的姿态引导生成辐射场(PGgraf)。在训练阶段,鉴别器将图像特征与两种姿态信息(射线方向集和相机角度)结合起来,引导生成器合成符合真实视图的x射线。在生成器中,我们设计了一个密度重构块(DRB)。与原始NERF直接根据粒子位置估计粒子密度不同,DRB考虑了在射线中采样的所有粒子特征,并整体预测每个粒子的密度。在两个胸部数据集和一个膝盖数据集的定性定量实验中,与最先进的NERF方案进行了比较,结果表明PGgraf在推断不同范围内的新视图方面具有明显的优势。在0°~ 360°、- 15°~ 15°和75°~ 105°三个范围内,峰值信噪比(PSNR)平均提高4.18分贝,学习感知图像斑块相似度(LPIPS)平均提高50.7%。
{"title":"PGgraf: Pose-Guided generative radiance field for novel-views on X-ray","authors":"Hangyu Li ,&nbsp;Moquan Liu ,&nbsp;Nan Wang,&nbsp;Mengcheng Sun,&nbsp;Yu Zhu","doi":"10.1016/j.displa.2026.103354","DOIUrl":"10.1016/j.displa.2026.103354","url":null,"abstract":"<div><div>In clinical diagnosis, doctors usually judge the information by a few X-rays to avoid excessive ionizing radiation from harming the patient. The recent Neural Radiance Field (NERF) technology contemplates generating novel-views from a single X-ray to assist physicians in diagnosis. In this task, we consider two advantages of X-ray filming over natural images: (1) The medical equipment is fixed, and there is a standardized filming pose. (2) There is an apparent structural prior to X-rays of the same body part at the same pose. Based on such conditions, we propose a Pose-Guided generative radiance field (PGgraf) containing a generator and discriminator. In the training phase, the discriminator combines the image features with two kinds of pose information (ray direction set and camera angle) to guide the generator to synthesize X-rays consistent with the realistic view. In the generator, we design a Density Reconstruction Block (DRB). Unlike the original NERF, which directly estimates the particle density based on the particle positions, the DRB considers all the particle features sampled in a ray and integrally predicts the density of each particle. Experiments comparing qualitative–quantitative on two chest datasets and one knee dataset with state-of-the-art NERF schemes show that PGgraf has a clear advantage in inferring novel-views at different ranges. In the three ranges of 0°to 360°, −15°to 15°, and 75°to 105°, the Peak Signal-to-Noise Ratio (PSNR) improved by an average of 4.18 decibel, and the Learned Perceptual Image Patch Similarity (LPIPS) improved by an average of 50.7%.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103354"},"PeriodicalIF":3.4,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new iterative inverse display model 一种新的迭代逆显示模型
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-16 DOI: 10.1016/j.displa.2026.103342
María José Pérez-Peñalver , S.-W. Lee , Cristina Jordán , Esther Sanabria-Codesal , Samuel Morillas
In this paper, we propose a new inverse model for display characterization based on the direct model developed in Kim and Lee (2015). We use an iterative method to compute what inputs are able to produce a desired color expressed in device independent color coordinates. Whereas iterative approaches have been used in the past for this task, the main novelty in our proposal is the use of specific heuristics based on the former display model and color science principles to achieve an efficient and accurate convergence. On the one hand, to set the initial point of the iterative process, we use orthogonal projections of the desired color chromaticity, xy, onto the display’s chromaticity triangle to find the initial ratio the RGB coordinates need to have. Subsequently, we use a factor product, preserving RGB proportions, to initially approximate the desired color’s luminance. This factor is obtained through a nonlinear modeling of the relation between RGB and luminance. On the other hand, to reduce the number of iterations needed, we use the direct model mentioned above: to set the RGB values of the next iteration we look at the differences between color prediction provided by the direct model for the current RGB values and desired color coordinates but looking separately at chromaticity and luminance following the same reasoning as for the initial point. As we will see from the experimental results, the method is accurate, efficient and robust. With respect to state of the art, method performance is specially good for low quality displays where physical assumptions made by other models do not hold completely.
在本文中,我们基于Kim和Lee(2015)开发的直接模型提出了一种新的显示表征逆模型。我们使用迭代方法来计算哪些输入能够产生以设备无关的颜色坐标表示的所需颜色。虽然迭代方法在过去已经被用于这项任务,但我们的提议的主要新颖之处在于使用基于前显示模型和颜色科学原理的特定启发式方法来实现高效和准确的收敛。一方面,为了设置迭代过程的初始点,我们使用所需颜色色度xy的正交投影到显示器的色度三角形上,以找到RGB坐标需要具有的初始比例。随后,我们使用因子乘积,保留RGB比例,以初步近似所需颜色的亮度。该因子是通过对RGB和亮度之间的关系进行非线性建模得到的。另一方面,为了减少所需的迭代次数,我们使用上面提到的直接模型:为了设置下一次迭代的RGB值,我们查看直接模型为当前RGB值和期望的颜色坐标提供的颜色预测之间的差异,但按照与初始点相同的推理分别查看色度和亮度。实验结果表明,该方法准确、高效、鲁棒性好。就目前的技术水平而言,在其他模型所做的物理假设不完全成立的低质量显示中,方法性能特别好。
{"title":"A new iterative inverse display model","authors":"María José Pérez-Peñalver ,&nbsp;S.-W. Lee ,&nbsp;Cristina Jordán ,&nbsp;Esther Sanabria-Codesal ,&nbsp;Samuel Morillas","doi":"10.1016/j.displa.2026.103342","DOIUrl":"10.1016/j.displa.2026.103342","url":null,"abstract":"<div><div>In this paper, we propose a new inverse model for display characterization based on the direct model developed in Kim and Lee (2015). We use an iterative method to compute what inputs are able to produce a desired color expressed in device independent color coordinates. Whereas iterative approaches have been used in the past for this task, the main novelty in our proposal is the use of specific heuristics based on the former display model and color science principles to achieve an efficient and accurate convergence. On the one hand, to set the initial point of the iterative process, we use orthogonal projections of the desired color chromaticity, <span><math><mrow><mi>x</mi><mi>y</mi></mrow></math></span>, onto the display’s chromaticity triangle to find the initial ratio the RGB coordinates need to have. Subsequently, we use a factor product, preserving RGB proportions, to initially approximate the desired color’s luminance. This factor is obtained through a nonlinear modeling of the relation between RGB and luminance. On the other hand, to reduce the number of iterations needed, we use the direct model mentioned above: to set the RGB values of the next iteration we look at the differences between color prediction provided by the direct model for the current RGB values and desired color coordinates but looking separately at chromaticity and luminance following the same reasoning as for the initial point. As we will see from the experimental results, the method is accurate, efficient and robust. With respect to state of the art, method performance is specially good for low quality displays where physical assumptions made by other models do not hold completely.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103342"},"PeriodicalIF":3.4,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient road marking extraction via cooperative enhancement of foundation models and Mamba 通过基础模型和曼巴的协同增强,高效提取道路标线
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-14 DOI: 10.1016/j.displa.2026.103351
Kang Zheng, Fu Ren
Road marking extraction is critical for high-definition mapping and autonomous driving, yet most lightweight models overlook the long-tailed appearance of thin markings during real-time inference. We propose Efficient Road Markings Segmentation Network (ERMSNet), a hybrid network that pairs lightweight design with the expressive power of Mamba and foundation models. ERMSNet comprises three synergistic branches. (1) A wavelet-augmented Baseline embeds a Road-Marking Mamba (RM-Mamba) whose bi-directional vertical scan captures elongated structures with fewer parameters than vanilla Mamba. (2) A Feature Enhancement branch distills dense image embeddings from the frozen Segment Anything Model (SAM) foundation model through a depth-wise squeeze-and-excitation adapter, injecting rich spatial detail at negligible cost. (3) An Attention Focusing branch projects text–image similarities produced by the Contrastive Language-Image Pre-training (CLIP) foundation model as soft masks that steer the decoder toward rare classes. Comprehensive experiments on CamVid, and our newly released Wuhan Road-Marking (WHRM) benchmark verify the design. Experimental results demonstrate that ERMSNet, with a lightweight configuration of only 0.99 million parameters and 6.44 GFLOPs, achieves mIoU scores of 79.85% and 81.18%, respectively. Compared with existing state-of-the-art methods, ERMSNet significantly reduces computational and memory costs while still delivering outstanding segmentation performance. Its superiority is especially evident in extracting thin and infrequently occurring road marking, highlighting its strong ability to balance efficiency and accuracy. Code and the WHRM dataset will be released upon publication.
道路标记提取对于高清地图和自动驾驶至关重要,但大多数轻量化模型在实时推断过程中忽略了细标记的长尾外观。我们提出了高效道路标记分割网络(ERMSNet),这是一个混合网络,将轻量级设计与Mamba和基础模型的表现力相结合。ERMSNet包括三个协同的分支机构。(1)小波增强基线嵌入道路标记曼巴(RM-Mamba),其双向垂直扫描捕获比香草曼巴参数更少的细长结构。(2) Feature Enhancement分支通过深度压缩和激励适配器从冻结的SAM基础模型中提取密集的图像嵌入,以可忽略不计的成本注入丰富的空间细节。(3)注意聚焦分支将对比语言-图像预训练(CLIP)基础模型产生的文本-图像相似性投影为软掩模,引导解码器转向稀有类。在CamVid上的综合实验,以及我们新发布的武汉道路标线(WHRM)基准验证了该设计。实验结果表明,在只有99万个参数和644个GFLOPs的轻量级配置下,ERMSNet的mIoU分数分别达到了79.85%和81.18%。与现有的最先进的方法相比,ERMSNet显著降低了计算和内存成本,同时仍然提供出色的分割性能。其优势在提取单薄且不常出现的道路标线方面尤为明显,突出了其平衡效率和准确性的强大能力。代码和WHRM数据集将在出版后发布。
{"title":"Efficient road marking extraction via cooperative enhancement of foundation models and Mamba","authors":"Kang Zheng,&nbsp;Fu Ren","doi":"10.1016/j.displa.2026.103351","DOIUrl":"10.1016/j.displa.2026.103351","url":null,"abstract":"<div><div>Road marking extraction is critical for high-definition mapping and autonomous driving, yet most lightweight models overlook the long-tailed appearance of thin markings during real-time inference. We propose Efficient Road Markings Segmentation Network (ERMSNet), a hybrid network that pairs lightweight design with the expressive power of Mamba and foundation models. ERMSNet comprises three synergistic branches. (1) A wavelet-augmented Baseline embeds a Road-Marking Mamba (RM-Mamba) whose bi-directional vertical scan captures elongated structures with fewer parameters than vanilla Mamba. (2) A Feature Enhancement branch distills dense image embeddings from the frozen Segment Anything Model (SAM) foundation model through a depth-wise squeeze-and-excitation adapter, injecting rich spatial detail at negligible cost. (3) An Attention Focusing branch projects text–image similarities produced by the Contrastive Language-Image Pre-training (CLIP) foundation model as soft masks that steer the decoder toward rare classes. Comprehensive experiments on CamVid, and our newly released Wuhan Road-Marking (WHRM) benchmark verify the design. Experimental results demonstrate that ERMSNet, with a lightweight configuration of only 0.99 million parameters and 6.44 GFLOPs, achieves mIoU scores of 79.85% and 81.18%, respectively. Compared with existing state-of-the-art methods, ERMSNet significantly reduces computational and memory costs while still delivering outstanding segmentation performance. Its superiority is especially evident in extracting thin and infrequently occurring road marking, highlighting its strong ability to balance efficiency and accuracy. Code and the WHRM dataset will be released upon publication.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103351"},"PeriodicalIF":3.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TDOA based localization mechanism for the UAV positioning in dark and confined environments 基于TDOA的无人机黑暗受限环境定位机制
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-14 DOI: 10.1016/j.displa.2026.103346
Haobin Shi , Quantao Wang , Zihan Wang , Jianning Zhan , Huijian Liang , Beiya Yang
With the growing demand for autonomous inspection with Unmanned Aerial Vehicles (UAVs) in dark and confined environments, accurately determining UAV position has become crucial. The Ultra-Wideband (UWB) localization technology offers a promising solution by overcoming challenges posed by signal obstruction, low illumination condition, and confined spaces. However, conventional UWB-based positioning suffers from performance oscillations due to measurement inconsistencies and degradations with time-varying noise models. Furthermore, the widely used Two-Way Time-of-Flight (TW-TOF) method has limitations, such as high energy consumption and a restricted number of tags to be deployed. To address these, a sensor fusion approach combining UWB and Inertial Measurement Unit (IMU) measurements with Time Difference of Arrival (TDOA) localization mechanism is proposed. This method exploits an adaptive Kalman filter, which dynamically adjusts to noise model variations and employs individual weighting factors for each anchor node, enhancing stability and robustness in challenging environments. The comprehensive experiments demonstrate the proposed algorithm achieves a median positioning error of 0.110 m, a 90th percentile error of 0.232 m, and an average standard deviation of 0.075 m with the significantly reduced energy consumption. Additionally, due to TDOA communication principles, this method supports multiple tag nodes, making it ideal for multi-UAV collaborative inspections in future applications.
随着无人机在黑暗和密闭环境中自主检测的需求日益增长,准确确定无人机的位置变得至关重要。超宽带(UWB)定位技术克服了信号障碍、低光照条件和受限空间等挑战,是一种很有前途的解决方案。然而,传统的基于uwb的定位由于测量不一致和时变噪声模型的退化而受到性能振荡的影响。此外,广泛使用的双向飞行时间(TW-TOF)方法存在一些局限性,例如高能耗和可部署的标签数量有限。为了解决这些问题,提出了一种结合超宽带和惯性测量单元(IMU)测量和到达时间差(TDOA)定位机制的传感器融合方法。该方法利用自适应卡尔曼滤波器,动态调整噪声模型的变化,并为每个锚节点采用单独的加权因子,增强了在挑战性环境中的稳定性和鲁棒性。综合实验表明,该算法的定位中位数误差为0.110 m,第90百分位误差为0.232 m,平均标准差为0.075 m,能耗显著降低。此外,由于TDOA通信原理,该方法支持多个标签节点,使其成为未来应用中多无人机协同检测的理想选择。
{"title":"TDOA based localization mechanism for the UAV positioning in dark and confined environments","authors":"Haobin Shi ,&nbsp;Quantao Wang ,&nbsp;Zihan Wang ,&nbsp;Jianning Zhan ,&nbsp;Huijian Liang ,&nbsp;Beiya Yang","doi":"10.1016/j.displa.2026.103346","DOIUrl":"10.1016/j.displa.2026.103346","url":null,"abstract":"<div><div>With the growing demand for autonomous inspection with Unmanned Aerial Vehicles (UAVs) in dark and confined environments, accurately determining UAV position has become crucial. The Ultra-Wideband (UWB) localization technology offers a promising solution by overcoming challenges posed by signal obstruction, low illumination condition, and confined spaces. However, conventional UWB-based positioning suffers from performance oscillations due to measurement inconsistencies and degradations with time-varying noise models. Furthermore, the widely used Two-Way Time-of-Flight (TW-TOF) method has limitations, such as high energy consumption and a restricted number of tags to be deployed. To address these, a sensor fusion approach combining UWB and Inertial Measurement Unit (IMU) measurements with Time Difference of Arrival (TDOA) localization mechanism is proposed. This method exploits an adaptive Kalman filter, which dynamically adjusts to noise model variations and employs individual weighting factors for each anchor node, enhancing stability and robustness in challenging environments. The comprehensive experiments demonstrate the proposed algorithm achieves a median positioning error of 0.110 m, a 90th percentile error of 0.232 m, and an average standard deviation of 0.075 m with the significantly reduced energy consumption. Additionally, due to TDOA communication principles, this method supports multiple tag nodes, making it ideal for multi-UAV collaborative inspections in future applications.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103346"},"PeriodicalIF":3.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking low-light image enhancement: A local–global synergy perspective 重新思考低光图像增强:局部-全局协同视角
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-13 DOI: 10.1016/j.displa.2026.103348
Qinghua Lin , Yu Long , Xudong Xiong , Wenchao Jiang , Zhihua Wang , Qiuping Jiang
Low-light image enhancement (LLIE) remains a challenging task due to the complex degradations in illumination, contrast, and structural details. Deep neural network-based approaches have shown promising results in addressing LLIE. However, most existing methods either utilize convolutional layers with local receptive fields, which are well-suited for restoring local textures, or Transformer layers with long-range dependencies, which are better at correcting global illumination. Despite their respective strengths, these approaches often struggle to effectively handle both aspects simultaneously. In this paper, we revisit LLIE from a local–global synergy perspective and propose a unified framework, the Local–Global Synergy Network (LGS-Net). LGS-Net explicitly extracts local and global features in parallel using a separable CNN and a Swin Transformer block, respectively, effectively modeling both local structural fidelity and global illumination balance. The extracted features are then fed into a squeeze-and-excitation-based fusion module, which adaptively integrates multi-scale information guided by perceptual relevance. Extensive experiments on multiple real-world benchmarks show that our method consistently outperforms existing state-of-the-art methods across both quantitative metrics (e.g., PSNR, SSIM, Q-Align) and perceptual quality, with notable improvements in color fidelity and detail preservation under extreme low-light and non-uniform illumination.
低光图像增强(LLIE)仍然是一个具有挑战性的任务,由于在照明,对比度和结构细节的复杂退化。基于深度神经网络的方法在解决LLIE方面显示出有希望的结果。然而,大多数现有方法要么使用具有局部接受域的卷积层,这非常适合恢复局部纹理,要么使用具有远程依赖关系的Transformer层,这更适合校正全局光照。尽管这些方法各自具有优势,但它们往往难以同时有效地处理这两个方面。在本文中,我们从本地-全球协同的角度重新审视LLIE,并提出了一个统一的框架,即本地-全球协同网络(LGS-Net)。LGS-Net分别使用可分离的CNN和Swin Transformer块明确地并行提取局部和全局特征,有效地模拟了局部结构保真度和全局照明平衡。然后将提取的特征输入到基于挤压和兴奋的融合模块中,该模块以感知相关性为导向自适应集成多尺度信息。在多个现实世界基准上的广泛实验表明,我们的方法在定量指标(例如,PSNR, SSIM, Q-Align)和感知质量方面始终优于现有的最先进的方法,在极低光和不均匀照明下的色彩保真度和细节保存方面有显着改善。
{"title":"Rethinking low-light image enhancement: A local–global synergy perspective","authors":"Qinghua Lin ,&nbsp;Yu Long ,&nbsp;Xudong Xiong ,&nbsp;Wenchao Jiang ,&nbsp;Zhihua Wang ,&nbsp;Qiuping Jiang","doi":"10.1016/j.displa.2026.103348","DOIUrl":"10.1016/j.displa.2026.103348","url":null,"abstract":"<div><div>Low-light image enhancement (LLIE) remains a challenging task due to the complex degradations in illumination, contrast, and structural details. Deep neural network-based approaches have shown promising results in addressing LLIE. However, most existing methods either utilize convolutional layers with local receptive fields, which are well-suited for restoring local textures, or Transformer layers with long-range dependencies, which are better at correcting global illumination. Despite their respective strengths, these approaches often struggle to effectively handle both aspects simultaneously. In this paper, we revisit LLIE from a local–global synergy perspective and propose a unified framework, the Local–Global Synergy Network (LGS-Net). LGS-Net explicitly extracts local and global features in parallel using a separable CNN and a Swin Transformer block, respectively, effectively modeling both local structural fidelity and global illumination balance. The extracted features are then fed into a squeeze-and-excitation-based fusion module, which adaptively integrates multi-scale information guided by perceptual relevance. Extensive experiments on multiple real-world benchmarks show that our method consistently outperforms existing state-of-the-art methods across both quantitative metrics (e.g., PSNR, SSIM, Q-Align) and perceptual quality, with notable improvements in color fidelity and detail preservation under extreme low-light and non-uniform illumination.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103348"},"PeriodicalIF":3.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Endo-E2E-GS: End-to-end 3D reconstruction of endoscopic scenes using Gaussian Splatting endoe - e2e - gs:使用高斯飞溅对内镜场景进行端到端三维重建
IF 3.4 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2026-01-13 DOI: 10.1016/j.displa.2026.103353
Xiongzhi Wang , Boyu Yang , Min Wei , Yu Chen , Jingang Zhang , Yunfeng Nie
Three-dimensional (3D) reconstruction is essential for enhancing spatial perception and geometric understanding in minimally invasive surgery. However, current methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often rely on offline preprocessing—such as COLMAP-based point clouds or multi-frame fusion—limiting their adaptability and clinical deployment. We propose Endo-E2E-GS, a fully end-to-end framework that reconstructs structured 3D Gaussian fields directly from a single stereo endoscopic image pair. The system integrates (1) a DilatedResNet-based stereo depth estimator for robust geometry inference in low-texture scenes, (2) a Gaussian attribute predictor that infers per-pixel rotation, scale, and opacity, and (3) a differentiable splatting renderer for 2D view supervision. Evaluated on the ENDONERF and SCARED datasets, Endo-E2E-GS achieves highly competitive performance, reaching PSNR values of 38.874/33.052 and SSIM scores of 0.978/0.863, respectively, surpassing recent state-of-the-art approaches. It requires no explicit scene initialization and demonstrates consistent performance across two representative endoscopic datasets. Code is available at: https://github.com/Intelligent-Imaging-Center/Endo-E2E-GS.
在微创手术中,三维重建对于增强空间感知和几何理解至关重要。然而,目前的方法,如神经辐射场(NeRF)和3D高斯飞溅(3DGS)通常依赖于离线预处理,如基于colmap的点云或多帧融合,限制了它们的适应性和临床部署。我们提出了Endo-E2E-GS,这是一个完全端到端的框架,可以直接从单个立体内窥镜图像对重建结构化的3D高斯场。该系统集成了(1)一个基于dilatedresnet的立体深度估计器,用于在低纹理场景中进行鲁棒的几何推断;(2)一个高斯属性预测器,用于推断每像素的旋转、比例和不透明度;(3)一个可微分的飞溅渲染器,用于2D视图监督。在ENDONERF和SCARED数据集上进行评估,Endo-E2E-GS具有很强的竞争力,PSNR值分别达到38.874/33.052,SSIM得分分别达到0.978/0.863,超过了目前最先进的方法。它不需要明确的场景初始化,并在两个代表性的内窥镜数据集上展示一致的性能。代码可从https://github.com/Intelligent-Imaging-Center/Endo-E2E-GS获得。
{"title":"Endo-E2E-GS: End-to-end 3D reconstruction of endoscopic scenes using Gaussian Splatting","authors":"Xiongzhi Wang ,&nbsp;Boyu Yang ,&nbsp;Min Wei ,&nbsp;Yu Chen ,&nbsp;Jingang Zhang ,&nbsp;Yunfeng Nie","doi":"10.1016/j.displa.2026.103353","DOIUrl":"10.1016/j.displa.2026.103353","url":null,"abstract":"<div><div>Three-dimensional (3D) reconstruction is essential for enhancing spatial perception and geometric understanding in minimally invasive surgery. However, current methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often rely on offline preprocessing—such as COLMAP-based point clouds or multi-frame fusion—limiting their adaptability and clinical deployment. We propose Endo-E2E-GS, a fully end-to-end framework that reconstructs structured 3D Gaussian fields directly from a single stereo endoscopic image pair. The system integrates (1) a DilatedResNet-based stereo depth estimator for robust geometry inference in low-texture scenes, (2) a Gaussian attribute predictor that infers per-pixel rotation, scale, and opacity, and (3) a differentiable splatting renderer for 2D view supervision. Evaluated on the ENDONERF and SCARED datasets, Endo-E2E-GS achieves highly competitive performance, reaching PSNR values of 38.874/33.052 and SSIM scores of 0.978/0.863, respectively, surpassing recent state-of-the-art approaches. It requires no explicit scene initialization and demonstrates consistent performance across two representative endoscopic datasets. Code is available at: <span><span><strong>https://github.com/Intelligent-Imaging-Center/Endo-E2E-GS</strong></span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103353"},"PeriodicalIF":3.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Displays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1