Pub Date : 2024-08-10DOI: 10.1016/j.displa.2024.102809
Wenqian Xu , Peiyu Wu , Qi Yao , Rongjun Zhang , Bang Qin , Dong Wang , Shenfei Chen , Yedong Shen
Long-term use of desktop displays may increase the burden of the visual system and users can use low blue light mode for eye protection in terms of circadian effect. In this work, we investigated its influence from two aspects of brightness-visual effect, namely efficacy and circadian effect, and color quality, namely color difference Δu’v’ (chromaticity coordinate offset of two colors), and Duv (deviation from blackbody locus). The decrease of brightness is accompanied by the increase of efficacy while diminishing circadian effect. The blue, cyan, and magenta have the largest Δu’v’, and the lower the saturation, the greater the Δu’v’. The lower the correlated color temperature (CCT), the greater the Duv and the farther it deviates from the Planckian locus. We summarize three low blue light mode adjustment strategies based on red, green, and blue three-channel ratio of spectrum, and propose an optimized mode using genetic algorithm, which has two optional CCT ranges of 3500–5000 K and 2700–3000 K. Furthermore, we establish the relationship between brightness and gamut coverage to refine the screen brightness range for low blue light mode. This research provides valuable insights into low blue light mode application and their implications for human-centric healthy displays.
长期使用台式显示器可能会增加视觉系统的负担,从昼夜节律效应的角度考虑,用户可以使用低蓝光模式来保护眼睛。在这项工作中,我们从亮度的两个方面--视觉效果(即功效和昼夜节律效应)和色彩质量(即色差Δu'v'(两种颜色的色度坐标偏移)和 Duv(偏离黑体位置))--研究了其影响。亮度的降低伴随着功效的提高,同时昼夜效应也会减弱。蓝色、青色和品红色的Δu'v'最大,饱和度越低,Δu'v'越大。相关色温(CCT)越低,Duv 越大,偏离普朗克位置越远。我们总结了基于红、绿、蓝三通道光谱比例的三种低蓝光模式调整策略,并利用遗传算法提出了一种优化模式,该模式有 3500-5000 K 和 2700-3000 K 两种可选色温范围。这项研究为低蓝光模式的应用及其对以人为本的健康显示器的影响提供了宝贵的见解。
{"title":"Evaluation and application strategy of low blue light mode of desktop display based on brightness characteristics","authors":"Wenqian Xu , Peiyu Wu , Qi Yao , Rongjun Zhang , Bang Qin , Dong Wang , Shenfei Chen , Yedong Shen","doi":"10.1016/j.displa.2024.102809","DOIUrl":"10.1016/j.displa.2024.102809","url":null,"abstract":"<div><p>Long-term use of desktop displays may increase the burden of the visual system and users can use low blue light mode for eye protection in terms of circadian effect. In this work, we investigated its influence from two aspects of brightness-visual effect, namely efficacy and circadian effect, and color quality, namely color difference Δu’v’ (chromaticity coordinate offset of two colors), and D<sub>uv</sub> (deviation from blackbody locus). The decrease of brightness is accompanied by the increase of efficacy while diminishing circadian effect. The blue, cyan, and magenta have the largest Δu’v’, and the lower the saturation, the greater the Δu’v’. The lower the correlated color temperature (CCT), the greater the D<sub>uv</sub> and the farther it deviates from the Planckian locus. We summarize three low blue light mode adjustment strategies based on red, green, and blue three-channel ratio of spectrum, and propose an optimized mode using genetic algorithm, which has two optional CCT ranges of 3500–5000 K and 2700–3000 K. Furthermore, we establish the relationship between brightness and gamut coverage to refine the screen brightness range for low blue light mode. This research provides valuable insights into low blue light mode application and their implications for human-centric healthy displays.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102809"},"PeriodicalIF":3.7,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.displa.2024.102805
Chen Cheng, Huahu Xu
Human posture estimation is still a hot research topic. Previous algorithms based on traditional machine learning have difficulties in feature extraction and low fusion efficiency. To address these problems, we proposed a Transformer-based method. We combined three techniques, namely the Transformer-based feature extraction module, the multi-scale feature fusion module, and the occlusion processing mechanism, to capture the human pose. The Transformer-based feature extraction module uses the self-attention mechanism to extract key features from the input sequence, the multi-scale feature fusion module fuses feature information of different scales to enhance the perception ability of the model, and the occlusion processing mechanism can effectively handle occlusion in the data and effectively remove background interference. Our method has shown excellent performance through verification on the standard dataset Human3.6M and the wild video dataset, achieving accurate pose prediction in both complex actions and challenging samples.
{"title":"Human pose estimation in complex background videos via Transformer-based multi-scale feature integration","authors":"Chen Cheng, Huahu Xu","doi":"10.1016/j.displa.2024.102805","DOIUrl":"10.1016/j.displa.2024.102805","url":null,"abstract":"<div><p>Human posture estimation is still a hot research topic. Previous algorithms based on traditional machine learning have difficulties in feature extraction and low fusion efficiency. To address these problems, we proposed a Transformer-based method. We combined three techniques, namely the Transformer-based feature extraction module, the multi-scale feature fusion module, and the occlusion processing mechanism, to capture the human pose. The Transformer-based feature extraction module uses the self-attention mechanism to extract key features from the input sequence, the multi-scale feature fusion module fuses feature information of different scales to enhance the perception ability of the model, and the occlusion processing mechanism can effectively handle occlusion in the data and effectively remove background interference. Our method has shown excellent performance through verification on the standard dataset Human3.6M and the wild video dataset, achieving accurate pose prediction in both complex actions and challenging samples.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102805"},"PeriodicalIF":3.7,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141964414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.displa.2024.102808
Yu Jin , Ying Shen , Wen-Jie Xu , Wen-Zhi Fan , Lei Xu , Xiao-Yu Gao , Yong Wu , Zhi-Yi Zhou , Wei-Jie Gu , Dong-Liang Yu , Jian-Qiu Sun , Li-Juan Ke , Wei-Bin Zhang , Wei-Qi Xu , Feng-Ying Xu
In this work, we systematically investigate low-temperature polycrystalline silicon (LTPS)-based driving circuits of electronic paper for the aim of adopting small width/length ratio (W/L) of LTPS-based thin film transistors (TFTs) to reduce switch error and thus improve image sticking. Firstly, LTPS-TFTs with extremely low off-state leakage current (IOFF) even at a large source-drain voltage (VDS) of 30 V were obtained through detailed explorations of LTPS process technology. Meanwhile, the high on-state current (ION) of LTPS-TFTs also meet the requirements of fast signal writing to the storage capacitor due to their extremely high field-effect mobility (approximately 100 cm2/V⋅s), making it possible to fabricate TFTs with relatively small W/L, thereby minimizing switch error. The ID-VD test results reveal that the produced LTPS-TFTs can effectively withstand the maximum voltage difference of 30 V during product operation. Subsequently, the optimal W/L of the LTPS-TFT was determined through experimental results. Then, reliability test was conducted on the obtained LTPS-TFTs, revealing that the threshold voltage (VTH) of the LTPS-TFTs shifted by 0.08 V after 7200 s under negative bias temperature stress (NBTS), and only by 0.19 V under positive bias temperature stress (PBTS). The aging test results of the aforementioned LTPS-TFTs exhibits a new physical phenomenon, that is, the IOFF of the LTPS-TFTs has a strict matching characteristic with the aging direction. Next, we proposed a novel 2T2C driving circuit for the e-paper, which can effectively avoid the adverse effects of IOFF on the frame holding period, and plotted it into an array layout. Finally, we combined the optimal fabricating process of the LTPS-TFTs with the 2T2C driving circuit design scheme to produce an e-paper with outstanding image sticking performance.
{"title":"Development of low-temperature polycrystalline silicon process and novel 2T2C driving circuits for electric paper","authors":"Yu Jin , Ying Shen , Wen-Jie Xu , Wen-Zhi Fan , Lei Xu , Xiao-Yu Gao , Yong Wu , Zhi-Yi Zhou , Wei-Jie Gu , Dong-Liang Yu , Jian-Qiu Sun , Li-Juan Ke , Wei-Bin Zhang , Wei-Qi Xu , Feng-Ying Xu","doi":"10.1016/j.displa.2024.102808","DOIUrl":"10.1016/j.displa.2024.102808","url":null,"abstract":"<div><p>In this work, we systematically investigate low-temperature polycrystalline silicon (LTPS)-based driving circuits of electronic paper for the aim of adopting small width/length ratio (W/L) of LTPS-based thin film transistors (TFTs) to reduce switch error and thus improve image sticking. Firstly, LTPS-TFTs with extremely low off-state leakage current (I<sub>OFF</sub>) even at a large source-drain voltage (V<sub>DS</sub>) of 30 V were obtained through detailed explorations of LTPS process technology. Meanwhile, the high on-state current (I<sub>ON</sub>) of LTPS-TFTs also meet the requirements of fast signal writing to the storage capacitor due to their extremely high field-effect mobility (approximately 100 cm<sup>2</sup>/V⋅s), making it possible to fabricate TFTs with relatively small W/L, thereby minimizing switch error. The I<sub>D</sub>-V<sub>D</sub> test results reveal that the produced LTPS-TFTs can effectively withstand the maximum voltage difference of 30 V during product operation. Subsequently, the optimal W/L of the LTPS-TFT was determined through experimental results. Then, reliability test was conducted on the obtained LTPS-TFTs, revealing that the threshold voltage (V<sub>TH</sub>) of the LTPS-TFTs shifted by 0.08 V after 7200 s under negative bias temperature stress (NBTS), and only by 0.19 V under positive bias temperature stress (PBTS). The aging test results of the aforementioned LTPS-TFTs exhibits a new physical phenomenon, that is, the I<sub>OFF</sub> of the LTPS-TFTs has a strict matching characteristic with the aging direction. Next, we proposed a novel 2T2C driving circuit for the e-paper, which can effectively avoid the adverse effects of I<sub>OFF</sub> on the frame holding period, and plotted it into an array layout. Finally, we combined the optimal fabricating process of the LTPS-TFTs with the 2T2C driving circuit design scheme to produce an e-paper with outstanding image sticking performance.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102808"},"PeriodicalIF":3.7,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141978907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1016/j.displa.2024.102806
Xingdong Sheng , Shijie Mao , Yichao Yan , Xiaokang Yang
Augmented Reality (AR) has gained significant attention in recent years as a technology that enhances the user’s perception and interaction with the real world by overlaying virtual objects. Simultaneous Localization and Mapping (SLAM) algorithm plays a crucial role in enabling AR applications by allowing the device to understand its position and orientation in the real world while mapping the environment. This paper first summarizes AR products and SLAM algorithms in recent years, and presents a comprehensive overview of SLAM algorithms including feature-based method, direct method, and deep learning-based method, highlighting their advantages and limitations. Then provides an in-depth exploration of classical SLAM algorithms for AR, with a focus on visual SLAM and visual-inertial SLAM. Lastly, sensor configuration, datasets, and performance evaluation for AR SLAM are also discussed. The review concludes with a summary of the current state of SLAM algorithms for AR and provides insights into future directions for research and development in this field. Overall, this review serves as a valuable resource for researchers and engineers who are interested in understanding the advancements and challenges in SLAM algorithms for AR.
增强现实(AR)技术通过叠加虚拟对象来增强用户对现实世界的感知和互动,近年来受到了广泛关注。同时定位和映射(SLAM)算法允许设备在映射环境的同时了解自己在现实世界中的位置和方向,在实现 AR 应用方面发挥着至关重要的作用。本文首先总结了近年来的 AR 产品和 SLAM 算法,并全面介绍了 SLAM 算法,包括基于特征的方法、直接方法和基于深度学习的方法,强调了它们的优势和局限性。然后深入探讨了 AR 的经典 SLAM 算法,重点介绍了视觉 SLAM 和视觉-惯性 SLAM。最后,还讨论了 AR SLAM 的传感器配置、数据集和性能评估。综述最后总结了 AR SLAM 算法的现状,并对该领域未来的研发方向提出了见解。总之,对于有兴趣了解 AR SLAM 算法的进展和挑战的研究人员和工程师来说,本综述是一份宝贵的资料。
{"title":"Review on SLAM algorithms for Augmented Reality","authors":"Xingdong Sheng , Shijie Mao , Yichao Yan , Xiaokang Yang","doi":"10.1016/j.displa.2024.102806","DOIUrl":"10.1016/j.displa.2024.102806","url":null,"abstract":"<div><p>Augmented Reality (AR) has gained significant attention in recent years as a technology that enhances the user’s perception and interaction with the real world by overlaying virtual objects. Simultaneous Localization and Mapping (SLAM) algorithm plays a crucial role in enabling AR applications by allowing the device to understand its position and orientation in the real world while mapping the environment. This paper first summarizes AR products and SLAM algorithms in recent years, and presents a comprehensive overview of SLAM algorithms including feature-based method, direct method, and deep learning-based method, highlighting their advantages and limitations. Then provides an in-depth exploration of classical SLAM algorithms for AR, with a focus on visual SLAM and visual-inertial SLAM. Lastly, sensor configuration, datasets, and performance evaluation for AR SLAM are also discussed. The review concludes with a summary of the current state of SLAM algorithms for AR and provides insights into future directions for research and development in this field. Overall, this review serves as a valuable resource for researchers and engineers who are interested in understanding the advancements and challenges in SLAM algorithms for AR.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102806"},"PeriodicalIF":3.7,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Light field (LF) images offer abundant spatial and angular information, therefore, the combination of which is beneficial in the performance of LF image superresolution (LF image SR). Currently, existing methods often decompose the 4D LF data into low-dimensional subspaces for individual feature extraction and fusion for LF image SR. However, the performance of these methods is restricted because of lacking effective correlations between subspaces and missing out on crucial complementary information for capturing rich texture details. To address this, we propose a cross-subspace fusion network for LF spatial SR (i.e., CSFNet). Specifically, we design the progressive cross-subspace fusion module (PCSFM), which can progressively establish cross-subspace correlations based on a cross-attention mechanism to comprehensively enrich LF information. Additionally, we propose a high-resolution adaptive enhancement group (HR-AEG), which preserves the texture and edge details in the high resolution feature domain by employing a multibranch enhancement method and an adaptive weight strategy. The experimental results demonstrate that our approach achieves highly competitive performance on multiple LF datasets compared to state-of-the-art (SOTA) methods.
光场(LF)图像提供了丰富的空间和角度信息,因此,将这些信息结合起来有利于实现 LF 图像超分辨率(LF 图像 SR)。目前,现有的方法通常将 4D 光场数据分解成低维子空间,用于单独特征提取和光场图像 SR 的融合。然而,这些方法的性能受到限制,因为子空间之间缺乏有效的相关性,无法捕捉到丰富纹理细节的关键互补信息。为此,我们提出了一种用于低频空间 SR 的跨子空间融合网络(即 CSFNet)。具体来说,我们设计了渐进式跨子空间融合模块(PCSFM),它可以基于交叉关注机制逐步建立跨子空间相关性,从而全面丰富低频信息。此外,我们还提出了高分辨率自适应增强组(HR-AEG),通过采用多分支增强方法和自适应权重策略,保留了高分辨率特征域中的纹理和边缘细节。实验结果表明,与最先进的(SOTA)方法相比,我们的方法在多个低频数据集上取得了极具竞争力的性能。
{"title":"High-resolution enhanced cross-subspace fusion network for light field image superresolution","authors":"Shixu Ying , Shubo Zhou , Xue-Qin Jiang , Yongbin Gao , Feng Pan , Zhijun Fang","doi":"10.1016/j.displa.2024.102803","DOIUrl":"10.1016/j.displa.2024.102803","url":null,"abstract":"<div><p>Light field (LF) images offer abundant spatial and angular information, therefore, the combination of which is beneficial in the performance of LF image superresolution (LF image SR). Currently, existing methods often decompose the 4D LF data into low-dimensional subspaces for individual feature extraction and fusion for LF image SR. However, the performance of these methods is restricted because of lacking effective correlations between subspaces and missing out on crucial complementary information for capturing rich texture details. To address this, we propose a cross-subspace fusion network for LF spatial SR (i.e., CSFNet). Specifically, we design the progressive cross-subspace fusion module (PCSFM), which can progressively establish cross-subspace correlations based on a cross-attention mechanism to comprehensively enrich LF information. Additionally, we propose a high-resolution adaptive enhancement group (HR-AEG), which preserves the texture and edge details in the high resolution feature domain by employing a multibranch enhancement method and an adaptive weight strategy. The experimental results demonstrate that our approach achieves highly competitive performance on multiple LF datasets compared to state-of-the-art (SOTA) methods.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102803"},"PeriodicalIF":3.7,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1016/j.displa.2024.102804
Yong Wu , Jinyu Tian , HuiJun Liu , Yuanyan Tang
Dense video captioning automatically locates events in untrimmed videos and describes event contents through natural language. This task has many potential applications, including security, assisting people who are visually impaired, and video retrieval. The related datasets constitute an important foundation for research on data-driven methods. However, the existing models for building dense video caption datasets were designed for the universal domain, often ignoring the characteristics and requirements of a specific domain. In addition, the one-way dataset construction process cannot form a closed-loop iterative scheme to improve the quality of the dataset. Therefore, this paper proposes a novel dataset construction model that is suitable for classroom-specific scenarios. On this basis, the Dense Video Caption Dataset of Student Classroom Behaviors (SCB-DVC) is constructed. Additionally, the existing dense video captioning methods typically utilize only temporal event boundaries as direct supervisory information during localization and fail to consider semantic information. This results in a limited correlation between the localization and captioning stages. This defect makes it more difficult to locate events in videos with oversmooth boundaries (due to the excessive similarity between the foregrounds and backgrounds (temporal domains) of events). Therefore, we propose a fine-grained semantic-aware assisted boundary localization-based dense video captioning method. This method enhances the ability to effectively learn the differential features between the foreground and background of an event by introducing semantic-aware information. It can provide increased boundary perception and achieve more accurate captions. Experimental results show that the proposed method performs well on both the SCB-DVC dataset and public datasets (ActivityNet Captions, YouCook2 and TACoS). We will release the SCB-DVC dataset soon.
{"title":"A dense video caption dataset of student classroom behaviors and a baseline model with boundary semantic awareness","authors":"Yong Wu , Jinyu Tian , HuiJun Liu , Yuanyan Tang","doi":"10.1016/j.displa.2024.102804","DOIUrl":"10.1016/j.displa.2024.102804","url":null,"abstract":"<div><p>Dense video captioning automatically locates events in untrimmed videos and describes event contents through natural language. This task has many potential applications, including security, assisting people who are visually impaired, and video retrieval. The related datasets constitute an important foundation for research on data-driven methods. However, the existing models for building dense video caption datasets were designed for the universal domain, often ignoring the characteristics and requirements of a specific domain. In addition, the one-way dataset construction process cannot form a closed-loop iterative scheme to improve the quality of the dataset. Therefore, this paper proposes a novel dataset construction model that is suitable for classroom-specific scenarios. On this basis, the Dense Video Caption Dataset of Student Classroom Behaviors (SCB-DVC) is constructed. Additionally, the existing dense video captioning methods typically utilize only temporal event boundaries as direct supervisory information during localization and fail to consider semantic information. This results in a limited correlation between the localization and captioning stages. This defect makes it more difficult to locate events in videos with oversmooth boundaries (due to the excessive similarity between the foregrounds and backgrounds (temporal domains) of events). Therefore, we propose a fine-grained semantic-aware assisted boundary localization-based dense video captioning method. This method enhances the ability to effectively learn the differential features between the foreground and background of an event by introducing semantic-aware information. It can provide increased boundary perception and achieve more accurate captions. Experimental results show that the proposed method performs well on both the SCB-DVC dataset and public datasets (ActivityNet Captions, YouCook2 and TACoS). We will release the SCB-DVC dataset soon.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102804"},"PeriodicalIF":3.7,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141959363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.
{"title":"CLIP2TF:Multimodal video–text retrieval for adolescent education","authors":"Xiaoning Sun, Tao Fan, Hongxu Li, Guozhong Wang, Peien Ge, Xiwu Shang","doi":"10.1016/j.displa.2024.102801","DOIUrl":"10.1016/j.displa.2024.102801","url":null,"abstract":"<div><p>With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102801"},"PeriodicalIF":3.7,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141840191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-24DOI: 10.1016/j.displa.2024.102798
Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Mohammad Alamgir Hossain
Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.
{"title":"ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor","authors":"Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Mohammad Alamgir Hossain","doi":"10.1016/j.displa.2024.102798","DOIUrl":"10.1016/j.displa.2024.102798","url":null,"abstract":"<div><p>Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102798"},"PeriodicalIF":3.7,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141951620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-22DOI: 10.1016/j.displa.2024.102797
Hongjian Wang, Suting Chen
In recent years, due to the dependence on wavelength-based light absorption and scattering, underwater photographs captured by devices often exhibit characteristics such as blurriness, faded color tones, and low contrast. To address these challenges, convolutional neural networks (CNNs) with their robust feature-capturing capabilities and adaptable structures have been employed for underwater image enhancement. However, most CNN-based studies on underwater image enhancement have not taken into account color space kernel convolution adaptability, which can significantly enhance the model’s expressive capacity. Building upon current academic research on adjusting the color space size for each perceptual field, this paper introduces a Double-Branch Multi-Perception Kernel Adaptive (DBMKA) model. A DBMKA module is constructed through two perceptual branches that adapt the kernels: channel features and local image entropy. Additionally, considering the pronounced attenuation of the red channel in underwater images, a Dependency-Capturing Feature Jump Connection module (DCFJC) has been designed to capture the red channel’s dependence on the blue and green channels for compensation. Its skip mechanism effectively preserves color contextual information. To better utilize the extracted features for enhancing underwater images, a Cross-Level Attention Feature Fusion (CLAFF) module has been designed. With the Double-Branch Multi-Perception Kernel Adaptive model, Dependency-Capturing Skip Connection module, and Cross-Level Adaptive Feature Fusion module, this network can effectively enhance various types of underwater images. Qualitative and quantitative evaluations were conducted on the UIEB and EUVP datasets. In the color correction comparison experiments, our method demonstrated a more uniform red channel distribution across all gray levels, maintaining color consistency and naturalness. Regarding image information entropy (IIE) and average gradient (AG), the data confirmed our method’s superiority in preserving image details. Furthermore, our proposed method showed performance improvements exceeding 10% on other metrics like MSE and UCIQE, further validating its effectiveness and accuracy.
{"title":"DBMKA-Net:Dual branch multi-perception kernel adaptation for underwater image enhancement","authors":"Hongjian Wang, Suting Chen","doi":"10.1016/j.displa.2024.102797","DOIUrl":"10.1016/j.displa.2024.102797","url":null,"abstract":"<div><p>In recent years, due to the dependence on wavelength-based light absorption and scattering, underwater photographs captured by devices often exhibit characteristics such as blurriness, faded color tones, and low contrast. To address these challenges, convolutional neural networks (CNNs) with their robust feature-capturing capabilities and adaptable structures have been employed for underwater image enhancement. However, most CNN-based studies on underwater image enhancement have not taken into account color space kernel convolution adaptability, which can significantly enhance the model’s expressive capacity. Building upon current academic research on adjusting the color space size for each perceptual field, this paper introduces a Double-Branch Multi-Perception Kernel Adaptive (DBMKA) model. A DBMKA module is constructed through two perceptual branches that adapt the kernels: channel features and local image entropy. Additionally, considering the pronounced attenuation of the red channel in underwater images, a Dependency-Capturing Feature Jump Connection module (DCFJC) has been designed to capture the red channel’s dependence on the blue and green channels for compensation. Its skip mechanism effectively preserves color contextual information. To better utilize the extracted features for enhancing underwater images, a Cross-Level Attention Feature Fusion (CLAFF) module has been designed. With the Double-Branch Multi-Perception Kernel Adaptive model, Dependency-Capturing Skip Connection module, and Cross-Level Adaptive Feature Fusion module, this network can effectively enhance various types of underwater images. Qualitative and quantitative evaluations were conducted on the UIEB and EUVP datasets. In the color correction comparison experiments, our method demonstrated a more uniform red channel distribution across all gray levels, maintaining color consistency and naturalness. Regarding image information entropy (IIE) and average gradient (AG), the data confirmed our method’s superiority in preserving image details. Furthermore, our proposed method showed performance improvements exceeding 10% on other metrics like MSE and UCIQE, further validating its effectiveness and accuracy.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102797"},"PeriodicalIF":3.7,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141847192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-20DOI: 10.1016/j.displa.2024.102799
Jinge Shi , Yi Chen , Chaofan Wang , Ali Asghar Heidari , Lei Liu , Huiling Chen , Xiaowei Chen , Li Sun
Lupus Nephritis (LN) has been considered as the most prevalent form of systemic lupus erythematosus. Medical imaging plays an important role in diagnosing and treating LN, which can help doctors accurately assess the extent and extent of the lesion. However, relying solely on visual observation and judgment can introduce subjectivity and errors, especially for complex pathological images. Image segmentation techniques are used to differentiate various tissues and structures in medical images to assist doctors in diagnosis. Multi-threshold Image Segmentation (MIS) has gained widespread recognition for its direct and practical application. However, existing MIS methods still have some issues. Therefore, this study combines non-local means, 2D histogram, and 2D Renyi’s entropy to improve the performance of MIS methods. Additionally, this study introduces an improved variant of the Whale Optimization Algorithm (GTMWOA) to optimize the aforementioned MIS methods and reduce algorithm complexity. The GTMWOA fusions Gaussian Exploration (GE), Topology Mapping (TM), and Magnetic Liquid Climbing (MLC). The GE effectively amplifies the algorithm’s proficiency in local exploration and quickens the convergence rate. The TM facilitates the algorithm in escaping local optima, while the MLC mechanism emulates the physical phenomenon of MLC, refining the algorithm’s convergence precision. This study conducted an extensive series of tests using the IEEE CEC 2017 benchmark functions to demonstrate the superior performance of GTMWOA in addressing intricate optimization problems. Furthermore, this study executed an experiment using Berkeley images and LN images to verify the superiority of GTMWOA in MIS. The ultimate outcomes of the MIS experiments substantiate the algorithm’s advanced capabilities and robustness in handling complex optimization problems.
狼疮性肾炎(LN)被认为是系统性红斑狼疮中最常见的一种。医学影像在诊断和治疗 LN 方面发挥着重要作用,可以帮助医生准确评估病变的范围和程度。然而,仅仅依靠肉眼观察和判断会带来主观性和误差,尤其是对于复杂的病理图像。图像分割技术可用于区分医学图像中的各种组织和结构,从而帮助医生进行诊断。多阈值图像分割(MIS)因其直接和实际的应用而得到广泛认可。然而,现有的多阈值图像分割方法仍存在一些问题。因此,本研究将非局部均值、二维直方图和二维仁义熵结合起来,以提高 MIS 方法的性能。此外,本研究还引入了鲸鱼优化算法(GTMWOA)的改进变体,以优化上述 MIS 方法并降低算法复杂度。GTMWOA 融合了高斯探索 (GE)、拓扑映射 (TM) 和磁液爬升 (MLC)。高斯探索有效提高了算法在局部探索方面的能力,并加快了收敛速度。TM有助于算法摆脱局部最优状态,而MLC机制则模拟了MLC的物理现象,提高了算法的收敛精度。本研究使用 IEEE CEC 2017 基准函数进行了一系列广泛的测试,以证明 GTMWOA 在解决复杂优化问题方面的卓越性能。此外,本研究还使用伯克利图像和 LN 图像进行了实验,以验证 GTMWOA 在 MIS 中的优越性。MIS 实验的最终结果证实了该算法在处理复杂优化问题时的先进能力和鲁棒性。
{"title":"Multi-threshold image segmentation using new strategies enhanced whale optimization for lupus nephritis pathological images","authors":"Jinge Shi , Yi Chen , Chaofan Wang , Ali Asghar Heidari , Lei Liu , Huiling Chen , Xiaowei Chen , Li Sun","doi":"10.1016/j.displa.2024.102799","DOIUrl":"10.1016/j.displa.2024.102799","url":null,"abstract":"<div><p>Lupus Nephritis (LN) has been considered as the most prevalent form of systemic lupus erythematosus. Medical imaging plays an important role in diagnosing and treating LN, which can help doctors accurately assess the extent and extent of the lesion. However, relying solely on visual observation and judgment can introduce subjectivity and errors, especially for complex pathological images. Image segmentation techniques are used to differentiate various tissues and structures in medical images to assist doctors in diagnosis. Multi-threshold Image Segmentation (MIS) has gained widespread recognition for its direct and practical application. However, existing MIS methods still have some issues. Therefore, this study combines non-local means, 2D histogram, and 2D Renyi’s entropy to improve the performance of MIS methods. Additionally, this study introduces an improved variant of the Whale Optimization Algorithm (GTMWOA) to optimize the aforementioned MIS methods and reduce algorithm complexity. The GTMWOA fusions Gaussian Exploration (GE), Topology Mapping (TM), and Magnetic Liquid Climbing (MLC). The GE effectively amplifies the algorithm’s proficiency in local exploration and quickens the convergence rate. The TM facilitates the algorithm in escaping local optima, while the MLC mechanism emulates the physical phenomenon of MLC, refining the algorithm’s convergence precision. This study conducted an extensive series of tests using the IEEE CEC 2017 benchmark functions to demonstrate the superior performance of GTMWOA in addressing intricate optimization problems. Furthermore, this study executed an experiment using Berkeley images and LN images to verify the superiority of GTMWOA in MIS. The ultimate outcomes of the MIS experiments substantiate the algorithm’s advanced capabilities and robustness in handling complex optimization problems.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102799"},"PeriodicalIF":3.7,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141849492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}