Pub Date : 2025-11-19DOI: 10.1016/j.displa.2025.103292
Junjie Li , Dewei Han , Jian Xu , Kang Li , Zhaoyuan Ma
Spatially separated teleoperation is crucial for inaccessible or hazardous scenarios but requires intuitive human–machine interfaces (HMIs) to ensure situational awareness, especially visual perception. While 360°panoramic vision offers immersion and a wide field of view, its high latency reduces efficiency and quality and causes motion sickness. This paper presents the Avatar system, an ultra-low-latency panoramic vision platform for teleoperation and telepresence. Using a convenient method, Avatar’s measured capture-to-display latency is only 220 ms. Two experiments with 43 participants demonstrated that Avatar achieves near-scene perception efficiency in near-field visual search. Its ultra-low latency also ensured high efficiency and quality in teleoperation tasks. Analysis of subjective questionnaires and physiological indicators confirmed that Avatar provides operators with intense immersion and presence. The system’s design and verification guide future universal, efficient HMI development for diverse applications.
{"title":"Design and evaluation of Avatar: An ultra-low-latency immersive human–machine interface for teleoperation","authors":"Junjie Li , Dewei Han , Jian Xu , Kang Li , Zhaoyuan Ma","doi":"10.1016/j.displa.2025.103292","DOIUrl":"10.1016/j.displa.2025.103292","url":null,"abstract":"<div><div>Spatially separated teleoperation is crucial for inaccessible or hazardous scenarios but requires intuitive human–machine interfaces (HMIs) to ensure situational awareness, especially visual perception. While 360°panoramic vision offers immersion and a wide field of view, its high latency reduces efficiency and quality and causes motion sickness. This paper presents the Avatar system, an ultra-low-latency panoramic vision platform for teleoperation and telepresence. Using a convenient method, Avatar’s measured capture-to-display latency is only 220 ms. Two experiments with 43 participants demonstrated that Avatar achieves near-scene perception efficiency in near-field visual search. Its ultra-low latency also ensured high efficiency and quality in teleoperation tasks. Analysis of subjective questionnaires and physiological indicators confirmed that Avatar provides operators with intense immersion and presence. The system’s design and verification guide future universal, efficient HMI development for diverse applications.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103292"},"PeriodicalIF":3.4,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.displa.2025.103290
Ru Huang , Zhimin Qian , Zhengbing Zhou , Zijian Chen , Jiannan Liu , Jing Han , Shuo Zhou , Jianhua He , Xiaoli Chu
With the deep integration of information technology, medical image segmentation has become a crucial tool for dermatological image analysis. However, existing dermatological lesion segmentation methods still face numerous challenges when dealing with complex lesion regions, which result in limited segmentation accuracy. Therefore, this study presents an adaptive segmentation network that draws inspiration from U-Net’s symmetric architecture, with the goal of improving the precision and generalizability of dermatological lesion segmentation. The proposed Visual Scaled Mamba (VSM) module incorporates residual pathways and adaptive scaling factors to enhance fine-grained feature extraction and enable hierarchical representation learning. Additionally, we propose the Multi-Scaled Cross-Axial Attention (MSCA) mechanism, integrating multiscale spatial features and enhancing blurred boundary recognition through dual cross-axial attention. Furthermore, we design an Adaptive Wave-Dilated Bottleneck (AWDB), employing adaptive dilated convolutions and wavelet transforms to improve feature representation and long-range dependency modeling. Through experimental results on the ISIC 2016, ISIC 2018, and PH2 public datasets show that our network achieves a good compromise between model complexity and segmentation accuracy, leading to considerable performance increases in dermatological image segmentation.
{"title":"An adaptive U-Net framework for dermatological lesion segmentation","authors":"Ru Huang , Zhimin Qian , Zhengbing Zhou , Zijian Chen , Jiannan Liu , Jing Han , Shuo Zhou , Jianhua He , Xiaoli Chu","doi":"10.1016/j.displa.2025.103290","DOIUrl":"10.1016/j.displa.2025.103290","url":null,"abstract":"<div><div>With the deep integration of information technology, medical image segmentation has become a crucial tool for dermatological image analysis. However, existing dermatological lesion segmentation methods still face numerous challenges when dealing with complex lesion regions, which result in limited segmentation accuracy. Therefore, this study presents an adaptive segmentation network that draws inspiration from U-Net’s symmetric architecture, with the goal of improving the precision and generalizability of dermatological lesion segmentation. The proposed Visual Scaled Mamba (VSM) module incorporates residual pathways and adaptive scaling factors to enhance fine-grained feature extraction and enable hierarchical representation learning. Additionally, we propose the Multi-Scaled Cross-Axial Attention (MSCA) mechanism, integrating multiscale spatial features and enhancing blurred boundary recognition through dual cross-axial attention. Furthermore, we design an Adaptive Wave-Dilated Bottleneck (AWDB), employing adaptive dilated convolutions and wavelet transforms to improve feature representation and long-range dependency modeling. Through experimental results on the ISIC 2016, ISIC 2018, and PH2 public datasets show that our network achieves a good compromise between model complexity and segmentation accuracy, leading to considerable performance increases in dermatological image segmentation.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103290"},"PeriodicalIF":3.4,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.displa.2025.103287
Wuzhen Shi, Wu Yang, Yang Wen
Image inpainting aims to reconstruct missing regions in images with visually realistic and semantically consistent content. Existing deep learning-based methods often rely on structural priors to guide the inpainting process, but these priors provide limited information for texture recovery, leading to blurred or inconsistent details. To address this issue, we propose a Texture Generation and Adaptive Fusion Network (TGAFNet) that explicitly models texture priors to enhance high-frequency texture generation and adaptive fusion. TGAFNet consists of two branches: a main branch for coarse image generation and refinement, and a texture branch for explicit texture synthesis. The texture branch exploits both contextual cues and multi-level features from the main branch to generate sharp texture maps under the guidance of adversarial training with SN-PatchGAN. Furthermore, a Texture Patch Adaptive Fusion (TPAF) module is introduced to perform patch-to-patch matching and adaptive fusion, effectively handling cross-domain misalignment between the generated texture and coarse images. Extensive experiments on multiple benchmark datasets demonstrate that TGAFNet achieves state-of-the-art performance, generating visually realistic and fine-textured results. The findings highlight the effectiveness of explicit texture priors and adaptive fusion mechanisms for high-fidelity image inpainting, offering a promising direction for future image restoration research.
{"title":"Texture generation and adaptive fusion networks for image inpainting","authors":"Wuzhen Shi, Wu Yang, Yang Wen","doi":"10.1016/j.displa.2025.103287","DOIUrl":"10.1016/j.displa.2025.103287","url":null,"abstract":"<div><div>Image inpainting aims to reconstruct missing regions in images with visually realistic and semantically consistent content. Existing deep learning-based methods often rely on structural priors to guide the inpainting process, but these priors provide limited information for texture recovery, leading to blurred or inconsistent details. To address this issue, we propose a Texture Generation and Adaptive Fusion Network (TGAFNet) that explicitly models texture priors to enhance high-frequency texture generation and adaptive fusion. TGAFNet consists of two branches: a main branch for coarse image generation and refinement, and a texture branch for explicit texture synthesis. The texture branch exploits both contextual cues and multi-level features from the main branch to generate sharp texture maps under the guidance of adversarial training with SN-PatchGAN. Furthermore, a Texture Patch Adaptive Fusion (TPAF) module is introduced to perform patch-to-patch matching and adaptive fusion, effectively handling cross-domain misalignment between the generated texture and coarse images. Extensive experiments on multiple benchmark datasets demonstrate that TGAFNet achieves state-of-the-art performance, generating visually realistic and fine-textured results. The findings highlight the effectiveness of explicit texture priors and adaptive fusion mechanisms for high-fidelity image inpainting, offering a promising direction for future image restoration research.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103287"},"PeriodicalIF":3.4,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.displa.2025.103289
Xuejuan Han , Zhong Qu , Shufang Xia
The problem of difficult traffic object localization under adverse weather has not been solved due to the labor-intensive and time-consuming process of collecting and labeling large-scale data. Domain adaptive object detection (DAOD) can achieve cross-domain detection without labeling, however, most of the existing DAOD methods are based on two-stage Faster R-CNN, which needs to be improved in both accuracy and speed. We propose a DAOD method TSA-YOLO, which takes full advantage of adversarial learning and pseudo-labeling to achieve high-performance cross-domain detection for fog, rain, and low-light scenes. For the input images, we generate auxiliary domain images by CycleGAN, also design a strong and weak enhancement method to reduce the bias of the teacher and student models. Additionally, in the student self-learning module, we propose a pixel-level domain discriminator to better extract domain-invariant features, effectively narrowing the feature distribution gap between the source and target domains. In the teacher–student mutual learning module, we incorporate the mean teacher (MT) model, iteratively update parameters to generate high-quality pseudo-labels. In addition, we evaluate our method on the public datasets Foggy Cityscapes, Rain Cityscapes, and BDD100k_Dark. The results show that TSA-YOLO significantly improves detection performance. Specifically, compared with the baseline, TSA-YOLO achieves up to a 15.0% increase in [email protected] on Foggy Cityscapes and up to an 18.5% increase on Rain Cityscapes, while converging in only 50 epochs and without reducing the model’s inference speed.
{"title":"Teacher–student adversarial YOLO for domain adaptive detection in traffic scenes under adverse weather","authors":"Xuejuan Han , Zhong Qu , Shufang Xia","doi":"10.1016/j.displa.2025.103289","DOIUrl":"10.1016/j.displa.2025.103289","url":null,"abstract":"<div><div>The problem of difficult traffic object localization under adverse weather has not been solved due to the labor-intensive and time-consuming process of collecting and labeling large-scale data. Domain adaptive object detection (DAOD) can achieve cross-domain detection without labeling, however, most of the existing DAOD methods are based on two-stage Faster R-CNN, which needs to be improved in both accuracy and speed. We propose a DAOD method TSA-YOLO, which takes full advantage of adversarial learning and pseudo-labeling to achieve high-performance cross-domain detection for fog, rain, and low-light scenes. For the input images, we generate auxiliary domain images by CycleGAN, also design a strong and weak enhancement method to reduce the bias of the teacher and student models. Additionally, in the student self-learning module, we propose a pixel-level domain discriminator to better extract domain-invariant features, effectively narrowing the feature distribution gap between the source and target domains. In the teacher–student mutual learning module, we incorporate the mean teacher (MT) model, iteratively update parameters to generate high-quality pseudo-labels. In addition, we evaluate our method on the public datasets Foggy Cityscapes, Rain Cityscapes, and BDD100k_Dark. The results show that TSA-YOLO significantly improves detection performance. Specifically, compared with the baseline, TSA-YOLO achieves up to a 15.0% increase in <em>[email protected]</em> on Foggy Cityscapes and up to an 18.5% increase on Rain Cityscapes, while converging in only 50 epochs and without reducing the model’s inference speed.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103289"},"PeriodicalIF":3.4,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.displa.2025.103288
Dawei Luo, Dongliang Xie, Wanpeng Xie
Lip-based visual biometric technology shows significant potential for improving the security of identity authentication in human–computer interaction. However, variations in lip contours and the entanglement of dynamic and semantic features limit its performance. To tackle these challenges, we revisit the personalized characteristics in lip-motion signals and propose a lip-based authentication framework based on personalized feature modeling. Specifically, the framework adopts a “shallow 3D CNN + deep 2D CNN” architecture to extract dynamic lip appearance features during speech, and introduces an appearance consistency loss to capture spatially invariant features across frames. For dynamic features, a semantic decoupling strategy is proposed to force the model to learn lip-motion patterns that are independent of semantic content. Additionally, we design a dynamic password authentication method based on visual speech recognition (VSR) to enhance system security. In our approach, appearance and motion patterns are used for speaker verification, while VSR results are used for passphrase verification — they working jointly. Experiments on the ICSLR and GRID datasets show that our method achieves excellent performance in terms of authentication accuracy and robustness, highlighting its potential in secure human–computer interaction scenarios. The code is made publicly available at https://github.com/Davi32ML/VSALip.
{"title":"Visual speaker authentication via lip motions: Appearance consistency and semantic disentanglement","authors":"Dawei Luo, Dongliang Xie, Wanpeng Xie","doi":"10.1016/j.displa.2025.103288","DOIUrl":"10.1016/j.displa.2025.103288","url":null,"abstract":"<div><div>Lip-based visual biometric technology shows significant potential for improving the security of identity authentication in human–computer interaction. However, variations in lip contours and the entanglement of dynamic and semantic features limit its performance. To tackle these challenges, we revisit the personalized characteristics in lip-motion signals and propose a lip-based authentication framework based on personalized feature modeling. Specifically, the framework adopts a “shallow 3D CNN + deep 2D CNN” architecture to extract dynamic lip appearance features during speech, and introduces an appearance consistency loss to capture spatially invariant features across frames. For dynamic features, a semantic decoupling strategy is proposed to force the model to learn lip-motion patterns that are independent of semantic content. Additionally, we design a dynamic password authentication method based on visual speech recognition (VSR) to enhance system security. In our approach, appearance and motion patterns are used for speaker verification, while VSR results are used for passphrase verification — they working jointly. Experiments on the ICSLR and GRID datasets show that our method achieves excellent performance in terms of authentication accuracy and robustness, highlighting its potential in secure human–computer interaction scenarios. The code is made publicly available at <span><span>https://github.com/Davi32ML/VSALip</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103288"},"PeriodicalIF":3.4,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1016/j.displa.2025.103285
Yingchun Xie , Wei Su , Shukai Chen , Jinzhao Wu , Chuan Cai , Yongna Yuan
Gloss-free sign language translation is a key focus in sign language translation research, enabling effective communication between the deaf and the hearing individuals in a broader and more universal manner. In this work, we propose a Progressive Multi-Level Learning model for sign language translation (PML-SLT), which progressively learns sign representations to improve video understanding. Rather than requiring every frame to attend to all other frames during attention computation, our approach introduces a progressive perceptual field expansion mechanism that gradually broadens the attention scope across video frames. This mechanism continuously expands the perceptual field between frames, effectively capturing both local and global information. Besides, to fully exploit multi-granularity information, we employ a multi-level feature integration scheme that transfers the output of each encoder layer to the corresponding decoder layer, enabling comprehensive utilization of hierarchical temporal features. Additionally, we introduce a multi-modal triplet loss to harmonize semantic information across modalities, aligning the text space with the video space so that the video features acquire richer semantic meaning. Experimental results on two public datasets demonstrate the promising translation performance of the proposed PML-SLT model.
{"title":"Progressive multi-level learning for gloss-free sign language translation","authors":"Yingchun Xie , Wei Su , Shukai Chen , Jinzhao Wu , Chuan Cai , Yongna Yuan","doi":"10.1016/j.displa.2025.103285","DOIUrl":"10.1016/j.displa.2025.103285","url":null,"abstract":"<div><div>Gloss-free sign language translation is a key focus in sign language translation research, enabling effective communication between the deaf and the hearing individuals in a broader and more universal manner. In this work, we propose a Progressive Multi-Level Learning model for sign language translation (PML-SLT), which progressively learns sign representations to improve video understanding. Rather than requiring every frame to attend to all other frames during attention computation, our approach introduces a progressive perceptual field expansion mechanism that gradually broadens the attention scope across video frames. This mechanism continuously expands the perceptual field between frames, effectively capturing both local and global information. Besides, to fully exploit multi-granularity information, we employ a multi-level feature integration scheme that transfers the output of each encoder layer to the corresponding decoder layer, enabling comprehensive utilization of hierarchical temporal features. Additionally, we introduce a multi-modal triplet loss to harmonize semantic information across modalities, aligning the text space with the video space so that the video features acquire richer semantic meaning. Experimental results on two public datasets demonstrate the promising translation performance of the proposed PML-SLT model.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103285"},"PeriodicalIF":3.4,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13DOI: 10.1016/j.displa.2025.103286
Jingyu Liu , Jiawei Zhang , Zhenyou Zou , Yibin Lin , Jinyu Ye , Wenfu Huang , Chaoxing Wu , Yongai Zhang , Jie Sun , Qun Yan , Xiongtu Zhou
The strong total internal reflection (TIR) in micro light-emitting diodes (Micro-LEDs) significantly limits light extraction efficiency (LEE) and uniformity of light distribution, thereby hindering their industrial applications. Inspired by the layered surface structures found in firefly lanterns, this study proposes a flexible bioinspired micro-/nano-composite structure that effectively enhances both LEE and the uniformity of light output. Finite-Difference Time-Domain (FDTD) simulations demonstrate that microstructures contribute to directional light extraction, whereas nanostructures facilitate overall optical optimization. A novel fabrication approach integrating grayscale photolithography, mechanical stretching, and plasma treatment was developed, enabling the realization of micro-/nano-composite structures with tunable design parameters. Experimental results indicate a 40.5% increase in external quantum efficiency (EQE) and a 41.6% improvement in power efficiency (PE) for blue Micro-LEDs, accompanied by enhanced angular light distribution, leading to wider viewing angles and near-ideal light uniformity. This advancement effectively resolves the longstanding challenge of balancing efficiency and uniformity in light extraction, thereby facilitating the industrialization of Micro-LED technology.
{"title":"Bioinspired micro-/nano-composite structures for simultaneous enhancement of light extraction efficiency and output uniformity in Micro-LEDs","authors":"Jingyu Liu , Jiawei Zhang , Zhenyou Zou , Yibin Lin , Jinyu Ye , Wenfu Huang , Chaoxing Wu , Yongai Zhang , Jie Sun , Qun Yan , Xiongtu Zhou","doi":"10.1016/j.displa.2025.103286","DOIUrl":"10.1016/j.displa.2025.103286","url":null,"abstract":"<div><div>The strong total internal reflection (TIR) in micro light-emitting diodes (Micro-LEDs) significantly limits light extraction efficiency (LEE) and uniformity of light distribution, thereby hindering their industrial applications. Inspired by the layered surface structures found in firefly lanterns, this study proposes a flexible bioinspired micro-/nano-composite structure that effectively enhances both LEE and the uniformity of light output. Finite-Difference Time-Domain (FDTD) simulations demonstrate that microstructures contribute to directional light extraction, whereas nanostructures facilitate overall optical optimization. A novel fabrication approach integrating grayscale photolithography, mechanical stretching, and plasma treatment was developed, enabling the realization of micro-/nano-composite structures with tunable design parameters. Experimental results indicate a 40.5% increase in external quantum efficiency (EQE) and a 41.6% improvement in power efficiency (PE) for blue Micro-LEDs, accompanied by enhanced angular light distribution, leading to wider viewing angles and near-ideal light uniformity. This advancement effectively resolves the longstanding challenge of balancing efficiency and uniformity in light extraction, thereby facilitating the industrialization of Micro-LED technology.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103286"},"PeriodicalIF":3.4,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.displa.2025.103283
Tomasz Konaszyński , Avrajyoti Dutta , Burak Gizlice , Dawid Juszka , Mikołaj Leszczuk
The work describes a QoE experiment concerning the assessment of the impact of memory and stimulus variability on subjective assessments of 2D videos, as well as an attempt to identify dominant points − moments in time or events influencing the overall assessment of the changing quality of the assessed films. Based on the results of the conducted QoE experiment, the impact of varying video quality on subjective assessment of 2D videos was clearly demonstrated, both in terms of results eligibility and subjective ratings.
The concept of “measurement points” was introduced, i.e., points in time or events that were associated with the highest impact on the values of subjective ratings when variable quality videos are assessed or videos are displayed in variable controlled environment.
The relationship between the memory of particular aspects of the video presentation, including the memory of subsequent appearances of the given video, and the values obtained from the assessment results were also demonstrated. There were observed regularities, including very strong negative effect of the variability of the technical quality of the rated videos on results eligibility, effect of boredom/annoyance from watching a longer video of variable quality, “last impression effect”, i.e. videos whose changing quality increases over time achieve higher MOS values than videos whose quality decreases over time and better assessments of “fresh” observations in comparison to the following ones.
{"title":"Measuring points for video subjective assessment – Impact of memory and stimulus variability","authors":"Tomasz Konaszyński , Avrajyoti Dutta , Burak Gizlice , Dawid Juszka , Mikołaj Leszczuk","doi":"10.1016/j.displa.2025.103283","DOIUrl":"10.1016/j.displa.2025.103283","url":null,"abstract":"<div><div>The work describes a QoE experiment concerning the assessment of the impact of memory and stimulus variability on subjective assessments of 2D videos, as well as an attempt to identify dominant points − moments in time or events influencing the overall assessment of the changing quality of the assessed films. Based on the results of the conducted QoE experiment, the impact of varying video quality on subjective assessment of 2D videos was clearly demonstrated, both in terms of results eligibility and subjective ratings.</div><div>The concept of “measurement points” was introduced, i.e., points in time or events that were associated with the highest impact on the values of subjective ratings when variable quality videos are assessed or videos are displayed in variable controlled environment.</div><div>The relationship between the memory of particular aspects of the video presentation, including the memory of subsequent appearances of the given video, and the values obtained from the assessment results were also demonstrated. There were observed regularities, including very strong negative effect of the variability of the technical quality of the rated videos on results eligibility, effect of boredom/annoyance from watching a longer video of variable quality, “last impression effect”, i.e. videos whose changing quality increases over time achieve higher MOS values than videos whose quality decreases over time and better assessments of “fresh” observations in comparison to the following ones.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103283"},"PeriodicalIF":3.4,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.displa.2025.103284
Shengjie Li , Jin Wang , Jianwei Niu , Yuanhang Wang , Haiyun Zhang , Guodong Lu , Jingru Yang , Xiaolong Yu , Renluan Hou
Occlusion and depth ambiguity pose significant challenges to the accuracy of monocular 3D human pose estimation. To tackle these issues, this study presents a two-stage pose estimation method based on Multi-Attention and Synchronous-Graph-Convolution (MASGC). In the first stage (2D pose estimation), a feature pyramid convolutional attention (FPCA) module is designed based on a multiresolution feature pyramid (MFP) and a convolutional attention triplet (CAT), which integrates channel, coordinate, and spatial attention, enabling the model to focus on the most salient features and mitigate location information loss caused by global pooling, thereby improving estimation accuracy. In the second stage (lifting to 3D), a temporal synchronous graph convolutional network (TSGCN) is designed. By incorporating multi-head attention and expanding the receptive field of end keypoints through topological temporal convolutions, TSGCN effectively addresses the challenges of occlusion and depth ambiguity in monocular 3D human pose estimation. Experimental results show that MASGC outperforms the compared baseline methods on benchmark datasets, including Human3.6 M and a custom dual-arm dataset, while reducing computational complexity compared to mainstream models. The code is available at https://github.com/JasonLi-30/MASGC.
{"title":"MASGC: Hybrid attention and synchronous graph learning for monocular 3D pose estimation","authors":"Shengjie Li , Jin Wang , Jianwei Niu , Yuanhang Wang , Haiyun Zhang , Guodong Lu , Jingru Yang , Xiaolong Yu , Renluan Hou","doi":"10.1016/j.displa.2025.103284","DOIUrl":"10.1016/j.displa.2025.103284","url":null,"abstract":"<div><div>Occlusion and depth ambiguity pose significant challenges to the accuracy of monocular 3D human pose estimation. To tackle these issues, this study presents a two-stage pose estimation method based on Multi-Attention and Synchronous-Graph-Convolution (MASGC). In the first stage (2D pose estimation), a feature pyramid convolutional attention (FPCA) module is designed based on a multiresolution feature pyramid (MFP) and a convolutional attention triplet (CAT), which integrates channel, coordinate, and spatial attention, enabling the model to focus on the most salient features and mitigate location information loss caused by global pooling, thereby improving estimation accuracy. In the second stage (lifting to 3D), a temporal synchronous graph convolutional network (TSGCN) is designed. By incorporating multi-head attention and expanding the receptive field of end keypoints through topological temporal convolutions, TSGCN effectively addresses the challenges of occlusion and depth ambiguity in monocular 3D human pose estimation. Experimental results show that MASGC outperforms the compared baseline methods on benchmark datasets, including Human3.6 M and a custom dual-arm dataset, while reducing computational complexity compared to mainstream models. The code is available at <span><span>https://github.com/JasonLi-30/MASGC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103284"},"PeriodicalIF":3.4,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145580487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.displa.2025.103282
Yizhi Zou , Shuangjie Yuan , Haoyu Liu , Xu Cheng , Tao Zhu , Lu Yang
In the cardiac operating room, physicians interpret patients’ vital signs from medical equipment to make critical decisions, such as administering blood transfusions. However, the absence of automated data acquisition in these devices significantly complicates the documentation of surgical information. Existing text recognition methods are often limited to single applications and lack broad generalization capabilities, with inconsistent detection and recognition times. We present a novel medical device screen recognition framework based on pretrained Vision Language Models (VLMs). The structure based on the vision language model significantly enhances the flexibility of the application scenario, and the multi-round dialogue is more humanized, allowing for a better understanding of the surgeon’s needs. Considering the existence of unclear image information acquired by a head-mounted camera, for the acquisition of screen data, we propose Medical Screen Data Acquisition-VLM (MSDA-VLM) with a pre-filtering module to detect image blur. This module detects heavy ghost images via Local Binary Pattern (LBP) based texture block matching and assesses image sharpness through the variance of deep feature maps. Trained by thousands of screen images on the pretrained VLM, we achieve a 17.07% improvement in precision and a 17.05% improvement in recall. Furthermore, the experiment results demonstrate notable enhancements in medical screen data recognition.
{"title":"Vision language model based panel digit recognition for medical screen data acquisition","authors":"Yizhi Zou , Shuangjie Yuan , Haoyu Liu , Xu Cheng , Tao Zhu , Lu Yang","doi":"10.1016/j.displa.2025.103282","DOIUrl":"10.1016/j.displa.2025.103282","url":null,"abstract":"<div><div>In the cardiac operating room, physicians interpret patients’ vital signs from medical equipment to make critical decisions, such as administering blood transfusions. However, the absence of automated data acquisition in these devices significantly complicates the documentation of surgical information. Existing text recognition methods are often limited to single applications and lack broad generalization capabilities, with inconsistent detection and recognition times. We present a novel medical device screen recognition framework based on pretrained Vision Language Models (VLMs). The structure based on the vision language model significantly enhances the flexibility of the application scenario, and the multi-round dialogue is more humanized, allowing for a better understanding of the surgeon’s needs. Considering the existence of unclear image information acquired by a head-mounted camera, for the acquisition of screen data, we propose Medical Screen Data Acquisition-VLM (MSDA-VLM) with a pre-filtering module to detect image blur. This module detects heavy ghost images via Local Binary Pattern (LBP) based texture block matching and assesses image sharpness through the variance of deep feature maps. Trained by thousands of screen images on the pretrained VLM, we achieve a 17.07% improvement in precision and a 17.05% improvement in recall. Furthermore, the experiment results demonstrate notable enhancements in medical screen data recognition.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103282"},"PeriodicalIF":3.4,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145529257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}