首页 > 最新文献

Displays最新文献

英文 中文
Review on SLAM algorithms for Augmented Reality 增强现实 SLAM 算法综述
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-31 DOI: 10.1016/j.displa.2024.102806
Xingdong Sheng , Shijie Mao , Yichao Yan , Xiaokang Yang

Augmented Reality (AR) has gained significant attention in recent years as a technology that enhances the user’s perception and interaction with the real world by overlaying virtual objects. Simultaneous Localization and Mapping (SLAM) algorithm plays a crucial role in enabling AR applications by allowing the device to understand its position and orientation in the real world while mapping the environment. This paper first summarizes AR products and SLAM algorithms in recent years, and presents a comprehensive overview of SLAM algorithms including feature-based method, direct method, and deep learning-based method, highlighting their advantages and limitations. Then provides an in-depth exploration of classical SLAM algorithms for AR, with a focus on visual SLAM and visual-inertial SLAM. Lastly, sensor configuration, datasets, and performance evaluation for AR SLAM are also discussed. The review concludes with a summary of the current state of SLAM algorithms for AR and provides insights into future directions for research and development in this field. Overall, this review serves as a valuable resource for researchers and engineers who are interested in understanding the advancements and challenges in SLAM algorithms for AR.

增强现实(AR)技术通过叠加虚拟对象来增强用户对现实世界的感知和互动,近年来受到了广泛关注。同时定位和映射(SLAM)算法允许设备在映射环境的同时了解自己在现实世界中的位置和方向,在实现 AR 应用方面发挥着至关重要的作用。本文首先总结了近年来的 AR 产品和 SLAM 算法,并全面介绍了 SLAM 算法,包括基于特征的方法、直接方法和基于深度学习的方法,强调了它们的优势和局限性。然后深入探讨了 AR 的经典 SLAM 算法,重点介绍了视觉 SLAM 和视觉-惯性 SLAM。最后,还讨论了 AR SLAM 的传感器配置、数据集和性能评估。综述最后总结了 AR SLAM 算法的现状,并对该领域未来的研发方向提出了见解。总之,对于有兴趣了解 AR SLAM 算法的进展和挑战的研究人员和工程师来说,本综述是一份宝贵的资料。
{"title":"Review on SLAM algorithms for Augmented Reality","authors":"Xingdong Sheng ,&nbsp;Shijie Mao ,&nbsp;Yichao Yan ,&nbsp;Xiaokang Yang","doi":"10.1016/j.displa.2024.102806","DOIUrl":"10.1016/j.displa.2024.102806","url":null,"abstract":"<div><p>Augmented Reality (AR) has gained significant attention in recent years as a technology that enhances the user’s perception and interaction with the real world by overlaying virtual objects. Simultaneous Localization and Mapping (SLAM) algorithm plays a crucial role in enabling AR applications by allowing the device to understand its position and orientation in the real world while mapping the environment. This paper first summarizes AR products and SLAM algorithms in recent years, and presents a comprehensive overview of SLAM algorithms including feature-based method, direct method, and deep learning-based method, highlighting their advantages and limitations. Then provides an in-depth exploration of classical SLAM algorithms for AR, with a focus on visual SLAM and visual-inertial SLAM. Lastly, sensor configuration, datasets, and performance evaluation for AR SLAM are also discussed. The review concludes with a summary of the current state of SLAM algorithms for AR and provides insights into future directions for research and development in this field. Overall, this review serves as a valuable resource for researchers and engineers who are interested in understanding the advancements and challenges in SLAM algorithms for AR.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102806"},"PeriodicalIF":3.7,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-resolution enhanced cross-subspace fusion network for light field image superresolution 用于光场图像超分辨率的高分辨率增强型跨子空间融合网络
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-29 DOI: 10.1016/j.displa.2024.102803
Shixu Ying , Shubo Zhou , Xue-Qin Jiang , Yongbin Gao , Feng Pan , Zhijun Fang

Light field (LF) images offer abundant spatial and angular information, therefore, the combination of which is beneficial in the performance of LF image superresolution (LF image SR). Currently, existing methods often decompose the 4D LF data into low-dimensional subspaces for individual feature extraction and fusion for LF image SR. However, the performance of these methods is restricted because of lacking effective correlations between subspaces and missing out on crucial complementary information for capturing rich texture details. To address this, we propose a cross-subspace fusion network for LF spatial SR (i.e., CSFNet). Specifically, we design the progressive cross-subspace fusion module (PCSFM), which can progressively establish cross-subspace correlations based on a cross-attention mechanism to comprehensively enrich LF information. Additionally, we propose a high-resolution adaptive enhancement group (HR-AEG), which preserves the texture and edge details in the high resolution feature domain by employing a multibranch enhancement method and an adaptive weight strategy. The experimental results demonstrate that our approach achieves highly competitive performance on multiple LF datasets compared to state-of-the-art (SOTA) methods.

光场(LF)图像提供了丰富的空间和角度信息,因此,将这些信息结合起来有利于实现 LF 图像超分辨率(LF 图像 SR)。目前,现有的方法通常将 4D 光场数据分解成低维子空间,用于单独特征提取和光场图像 SR 的融合。然而,这些方法的性能受到限制,因为子空间之间缺乏有效的相关性,无法捕捉到丰富纹理细节的关键互补信息。为此,我们提出了一种用于低频空间 SR 的跨子空间融合网络(即 CSFNet)。具体来说,我们设计了渐进式跨子空间融合模块(PCSFM),它可以基于交叉关注机制逐步建立跨子空间相关性,从而全面丰富低频信息。此外,我们还提出了高分辨率自适应增强组(HR-AEG),通过采用多分支增强方法和自适应权重策略,保留了高分辨率特征域中的纹理和边缘细节。实验结果表明,与最先进的(SOTA)方法相比,我们的方法在多个低频数据集上取得了极具竞争力的性能。
{"title":"High-resolution enhanced cross-subspace fusion network for light field image superresolution","authors":"Shixu Ying ,&nbsp;Shubo Zhou ,&nbsp;Xue-Qin Jiang ,&nbsp;Yongbin Gao ,&nbsp;Feng Pan ,&nbsp;Zhijun Fang","doi":"10.1016/j.displa.2024.102803","DOIUrl":"10.1016/j.displa.2024.102803","url":null,"abstract":"<div><p>Light field (LF) images offer abundant spatial and angular information, therefore, the combination of which is beneficial in the performance of LF image superresolution (LF image SR). Currently, existing methods often decompose the 4D LF data into low-dimensional subspaces for individual feature extraction and fusion for LF image SR. However, the performance of these methods is restricted because of lacking effective correlations between subspaces and missing out on crucial complementary information for capturing rich texture details. To address this, we propose a cross-subspace fusion network for LF spatial SR (i.e., CSFNet). Specifically, we design the progressive cross-subspace fusion module (PCSFM), which can progressively establish cross-subspace correlations based on a cross-attention mechanism to comprehensively enrich LF information. Additionally, we propose a high-resolution adaptive enhancement group (HR-AEG), which preserves the texture and edge details in the high resolution feature domain by employing a multibranch enhancement method and an adaptive weight strategy. The experimental results demonstrate that our approach achieves highly competitive performance on multiple LF datasets compared to state-of-the-art (SOTA) methods.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102803"},"PeriodicalIF":3.7,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dense video caption dataset of student classroom behaviors and a baseline model with boundary semantic awareness 学生课堂行为的密集视频字幕数据集和具有边界语义意识的基线模型
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-26 DOI: 10.1016/j.displa.2024.102804
Yong Wu , Jinyu Tian , HuiJun Liu , Yuanyan Tang

Dense video captioning automatically locates events in untrimmed videos and describes event contents through natural language. This task has many potential applications, including security, assisting people who are visually impaired, and video retrieval. The related datasets constitute an important foundation for research on data-driven methods. However, the existing models for building dense video caption datasets were designed for the universal domain, often ignoring the characteristics and requirements of a specific domain. In addition, the one-way dataset construction process cannot form a closed-loop iterative scheme to improve the quality of the dataset. Therefore, this paper proposes a novel dataset construction model that is suitable for classroom-specific scenarios. On this basis, the Dense Video Caption Dataset of Student Classroom Behaviors (SCB-DVC) is constructed. Additionally, the existing dense video captioning methods typically utilize only temporal event boundaries as direct supervisory information during localization and fail to consider semantic information. This results in a limited correlation between the localization and captioning stages. This defect makes it more difficult to locate events in videos with oversmooth boundaries (due to the excessive similarity between the foregrounds and backgrounds (temporal domains) of events). Therefore, we propose a fine-grained semantic-aware assisted boundary localization-based dense video captioning method. This method enhances the ability to effectively learn the differential features between the foreground and background of an event by introducing semantic-aware information. It can provide increased boundary perception and achieve more accurate captions. Experimental results show that the proposed method performs well on both the SCB-DVC dataset and public datasets (ActivityNet Captions, YouCook2 and TACoS). We will release the SCB-DVC dataset soon.

密集视频字幕可以自动定位未剪辑视频中的事件,并通过自然语言描述事件内容。这项任务有许多潜在的应用,包括安全、帮助视障人士和视频检索。相关数据集是研究数据驱动方法的重要基础。然而,现有的密集视频字幕数据集构建模型都是针对通用领域设计的,往往忽略了特定领域的特点和要求。此外,单向的数据集构建过程无法形成闭环迭代方案来提高数据集的质量。因此,本文提出了一种适用于教室特定场景的新型数据集构建模型。在此基础上,构建了学生课堂行为密集视频字幕数据集(SCB-DVC)。此外,现有的密集视频字幕方法在定位过程中通常只利用时间事件边界作为直接监督信息,而不考虑语义信息。这导致定位和字幕制作阶段之间的相关性有限。这一缺陷使得在边界过于光滑的视频中定位事件变得更加困难(由于事件的前景和背景(时域)之间的相似性过高)。因此,我们提出了一种基于细粒度语义感知辅助边界定位的密集视频字幕制作方法。该方法通过引入语义感知信息,增强了有效学习事件前景和背景之间差异特征的能力。它能提高边界感知能力,实现更准确的字幕。实验结果表明,所提出的方法在 SCB-DVC 数据集和公共数据集(ActivityNet Captions、YouCook2 和 TACoS)上都表现良好。我们将很快发布 SCB-DVC 数据集。
{"title":"A dense video caption dataset of student classroom behaviors and a baseline model with boundary semantic awareness","authors":"Yong Wu ,&nbsp;Jinyu Tian ,&nbsp;HuiJun Liu ,&nbsp;Yuanyan Tang","doi":"10.1016/j.displa.2024.102804","DOIUrl":"10.1016/j.displa.2024.102804","url":null,"abstract":"<div><p>Dense video captioning automatically locates events in untrimmed videos and describes event contents through natural language. This task has many potential applications, including security, assisting people who are visually impaired, and video retrieval. The related datasets constitute an important foundation for research on data-driven methods. However, the existing models for building dense video caption datasets were designed for the universal domain, often ignoring the characteristics and requirements of a specific domain. In addition, the one-way dataset construction process cannot form a closed-loop iterative scheme to improve the quality of the dataset. Therefore, this paper proposes a novel dataset construction model that is suitable for classroom-specific scenarios. On this basis, the Dense Video Caption Dataset of Student Classroom Behaviors (SCB-DVC) is constructed. Additionally, the existing dense video captioning methods typically utilize only temporal event boundaries as direct supervisory information during localization and fail to consider semantic information. This results in a limited correlation between the localization and captioning stages. This defect makes it more difficult to locate events in videos with oversmooth boundaries (due to the excessive similarity between the foregrounds and backgrounds (temporal domains) of events). Therefore, we propose a fine-grained semantic-aware assisted boundary localization-based dense video captioning method. This method enhances the ability to effectively learn the differential features between the foreground and background of an event by introducing semantic-aware information. It can provide increased boundary perception and achieve more accurate captions. Experimental results show that the proposed method performs well on both the SCB-DVC dataset and public datasets (ActivityNet Captions, YouCook2 and TACoS). We will release the SCB-DVC dataset soon.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102804"},"PeriodicalIF":3.7,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141959363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLIP2TF:Multimodal video–text retrieval for adolescent education CLIP2TF:面向青少年教育的多模态视频-文本检索
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-25 DOI: 10.1016/j.displa.2024.102801
Xiaoning Sun, Tao Fan, Hongxu Li, Guozhong Wang, Peien Ge, Xiwu Shang

With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.

随着人工智能技术的快速发展,尤其是在青少年教育领域,新的挑战和机遇不断涌现。当前的教育系统越来越需要实现教学活动检测和评估的自动化,这为提高青少年教育质量提供了新的视角。虽然大规模模型在教育研究中备受关注,但其对计算资源的高要求和在具体应用中的局限性限制了其在教育视频内容分析中的广泛应用,尤其是在处理多模态数据时。目前的多模态对比学习方法整合了视频、音频和文本信息,在视频文本检索任务中取得了一定的成功。然而,这些方法通常采用较简单的加权融合策略,无法避免噪声和信息冗余。因此,我们的研究提出了一个新颖的网络框架 CLIP2TF,其中包括一个高效的视听融合编码器。它旨在动态交互和整合视觉与音频特征,进一步增强特定教学场景中可能缺失或不足的视觉特征,同时有效减少模态融合过程中的冗余信息传输。通过对 MSRVTT 和 MSVD 数据集的消融实验,我们首先证明了 CLIP2TF 在视频-文本检索任务中的有效性。随后在教学视频数据集上的测试进一步证明了所提方法的适用性。这项研究不仅展示了人工智能在自动评估教学质量方面的潜力,也为相关领域的研究提供了新的方向。
{"title":"CLIP2TF:Multimodal video–text retrieval for adolescent education","authors":"Xiaoning Sun,&nbsp;Tao Fan,&nbsp;Hongxu Li,&nbsp;Guozhong Wang,&nbsp;Peien Ge,&nbsp;Xiwu Shang","doi":"10.1016/j.displa.2024.102801","DOIUrl":"10.1016/j.displa.2024.102801","url":null,"abstract":"<div><p>With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102801"},"PeriodicalIF":3.7,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141840191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor ICEAP:带有增强型属性预测器的高级细粒度图像字幕网络
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-24 DOI: 10.1016/j.displa.2024.102798
Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Mohammad Alamgir Hossain

Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.

细粒度图像字幕是视觉转语言任务中的一个焦点,在生成准确且与上下文相关的图像字幕方面引起了广泛关注。有效的属性预测及其利用在提高图像标题性能方面起着至关重要的作用。尽管之前与属性相关的方法取得了进展,但这些方法要么侧重于预测与输入图像相关的属性,要么侧重于在语言模型的每个时间步骤中预测与语言上下文相关的属性。然而,这些方法往往忽视了平衡视觉和语言上下文的重要性,从而导致语义信息的无效利用和随之而来的性能下降。为了解决这些问题,我们引入了独立属性预测器(IAP),通过利用视觉对象和属性嵌入之间的关系来精确预测与输入图像相关的属性。随后,又提出了增强型属性预测器(EAP),首先预测与语言上下文相关的属性,然后利用 IAP 模块的先验概率重新平衡图像和语言上下文相关属性,从而生成更稳健、更增强的属性概率。这些经过改进的属性随后被整合到语言 LSTM 层,以确保在每个时间步骤中进行准确的单词预测。在我们提出的图像字幕增强属性预测器(ICEAP)模型中,IAP 和 EAP 模块的集成有效地整合了高层语义细节,从而提高了模型的整体性能。通过交叉熵优化,ICEAP 的表现优于同类模型,其在 MS-COCO 数据集、Flickr30K 数据集和 Flickr8K 数据集上的 CIDEr-D 得分平均提高了 10.62%,Flickr30K 数据集提高了 9.63%,Flickr8K 数据集提高了 7.74%,定性分析证实了其生成细粒度标题的能力。
{"title":"ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor","authors":"Md. Bipul Hossen,&nbsp;Zhongfu Ye,&nbsp;Amr Abdussalam,&nbsp;Mohammad Alamgir Hossain","doi":"10.1016/j.displa.2024.102798","DOIUrl":"10.1016/j.displa.2024.102798","url":null,"abstract":"<div><p>Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102798"},"PeriodicalIF":3.7,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141951620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DBMKA-Net:Dual branch multi-perception kernel adaptation for underwater image enhancement DBMKA-Net:用于水下图像增强的双分支多感知内核适配
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-22 DOI: 10.1016/j.displa.2024.102797
Hongjian Wang, Suting Chen

In recent years, due to the dependence on wavelength-based light absorption and scattering, underwater photographs captured by devices often exhibit characteristics such as blurriness, faded color tones, and low contrast. To address these challenges, convolutional neural networks (CNNs) with their robust feature-capturing capabilities and adaptable structures have been employed for underwater image enhancement. However, most CNN-based studies on underwater image enhancement have not taken into account color space kernel convolution adaptability, which can significantly enhance the model’s expressive capacity. Building upon current academic research on adjusting the color space size for each perceptual field, this paper introduces a Double-Branch Multi-Perception Kernel Adaptive (DBMKA) model. A DBMKA module is constructed through two perceptual branches that adapt the kernels: channel features and local image entropy. Additionally, considering the pronounced attenuation of the red channel in underwater images, a Dependency-Capturing Feature Jump Connection module (DCFJC) has been designed to capture the red channel’s dependence on the blue and green channels for compensation. Its skip mechanism effectively preserves color contextual information. To better utilize the extracted features for enhancing underwater images, a Cross-Level Attention Feature Fusion (CLAFF) module has been designed. With the Double-Branch Multi-Perception Kernel Adaptive model, Dependency-Capturing Skip Connection module, and Cross-Level Adaptive Feature Fusion module, this network can effectively enhance various types of underwater images. Qualitative and quantitative evaluations were conducted on the UIEB and EUVP datasets. In the color correction comparison experiments, our method demonstrated a more uniform red channel distribution across all gray levels, maintaining color consistency and naturalness. Regarding image information entropy (IIE) and average gradient (AG), the data confirmed our method’s superiority in preserving image details. Furthermore, our proposed method showed performance improvements exceeding 10% on other metrics like MSE and UCIQE, further validating its effectiveness and accuracy.

近年来,由于对基于波长的光吸收和散射的依赖,设备拍摄的水下照片往往表现出模糊、色调褪色和对比度低等特征。为了应对这些挑战,卷积神经网络(CNN)凭借其强大的特征捕捉能力和适应性强的结构被用于水下图像增强。然而,大多数基于卷积神经网络的水下图像增强研究都没有考虑色彩空间内核卷积的适应性,而这种适应性可以显著增强模型的表达能力。在目前学术界关于调整各感知领域色彩空间大小的研究基础上,本文介绍了双分支多感知核自适应(DBMKA)模型。DBMKA 模块是通过两个感知分支来构建的,这两个分支分别对通道特征和局部图像熵进行内核自适应。此外,考虑到水下图像中红色通道的明显衰减,还设计了依赖捕捉特征跳转连接模块(DCFJC),以捕捉红色通道对蓝色和绿色通道的依赖性,从而进行补偿。其跳转机制可有效保留色彩上下文信息。为了更好地利用提取的特征来增强水下图像,我们设计了一个跨级别注意力特征融合(CLAFF)模块。通过双分支多感知内核自适应模型、依赖捕捉跳转连接模块和跨层自适应特征融合模块,该网络可有效增强各类水下图像。在 UIEB 和 EUVP 数据集上进行了定性和定量评估。在色彩校正对比实验中,我们的方法在各灰度级的红色通道分布更加均匀,保持了色彩的一致性和自然度。在图像信息熵(IIE)和平均梯度(AG)方面,数据证实了我们的方法在保留图像细节方面的优势。此外,我们提出的方法在 MSE 和 UCIQE 等其他指标上的性能改进超过了 10%,进一步验证了其有效性和准确性。
{"title":"DBMKA-Net:Dual branch multi-perception kernel adaptation for underwater image enhancement","authors":"Hongjian Wang,&nbsp;Suting Chen","doi":"10.1016/j.displa.2024.102797","DOIUrl":"10.1016/j.displa.2024.102797","url":null,"abstract":"<div><p>In recent years, due to the dependence on wavelength-based light absorption and scattering, underwater photographs captured by devices often exhibit characteristics such as blurriness, faded color tones, and low contrast. To address these challenges, convolutional neural networks (CNNs) with their robust feature-capturing capabilities and adaptable structures have been employed for underwater image enhancement. However, most CNN-based studies on underwater image enhancement have not taken into account color space kernel convolution adaptability, which can significantly enhance the model’s expressive capacity. Building upon current academic research on adjusting the color space size for each perceptual field, this paper introduces a Double-Branch Multi-Perception Kernel Adaptive (DBMKA) model. A DBMKA module is constructed through two perceptual branches that adapt the kernels: channel features and local image entropy. Additionally, considering the pronounced attenuation of the red channel in underwater images, a Dependency-Capturing Feature Jump Connection module (DCFJC) has been designed to capture the red channel’s dependence on the blue and green channels for compensation. Its skip mechanism effectively preserves color contextual information. To better utilize the extracted features for enhancing underwater images, a Cross-Level Attention Feature Fusion (CLAFF) module has been designed. With the Double-Branch Multi-Perception Kernel Adaptive model, Dependency-Capturing Skip Connection module, and Cross-Level Adaptive Feature Fusion module, this network can effectively enhance various types of underwater images. Qualitative and quantitative evaluations were conducted on the UIEB and EUVP datasets. In the color correction comparison experiments, our method demonstrated a more uniform red channel distribution across all gray levels, maintaining color consistency and naturalness. Regarding image information entropy (IIE) and average gradient (AG), the data confirmed our method’s superiority in preserving image details. Furthermore, our proposed method showed performance improvements exceeding 10% on other metrics like MSE and UCIQE, further validating its effectiveness and accuracy.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102797"},"PeriodicalIF":3.7,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141847192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-threshold image segmentation using new strategies enhanced whale optimization for lupus nephritis pathological images 针对狼疮性肾炎病理图像的多阈值图像分割新策略--增强鲸鱼优化法
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-20 DOI: 10.1016/j.displa.2024.102799
Jinge Shi , Yi Chen , Chaofan Wang , Ali Asghar Heidari , Lei Liu , Huiling Chen , Xiaowei Chen , Li Sun

Lupus Nephritis (LN) has been considered as the most prevalent form of systemic lupus erythematosus. Medical imaging plays an important role in diagnosing and treating LN, which can help doctors accurately assess the extent and extent of the lesion. However, relying solely on visual observation and judgment can introduce subjectivity and errors, especially for complex pathological images. Image segmentation techniques are used to differentiate various tissues and structures in medical images to assist doctors in diagnosis. Multi-threshold Image Segmentation (MIS) has gained widespread recognition for its direct and practical application. However, existing MIS methods still have some issues. Therefore, this study combines non-local means, 2D histogram, and 2D Renyi’s entropy to improve the performance of MIS methods. Additionally, this study introduces an improved variant of the Whale Optimization Algorithm (GTMWOA) to optimize the aforementioned MIS methods and reduce algorithm complexity. The GTMWOA fusions Gaussian Exploration (GE), Topology Mapping (TM), and Magnetic Liquid Climbing (MLC). The GE effectively amplifies the algorithm’s proficiency in local exploration and quickens the convergence rate. The TM facilitates the algorithm in escaping local optima, while the MLC mechanism emulates the physical phenomenon of MLC, refining the algorithm’s convergence precision. This study conducted an extensive series of tests using the IEEE CEC 2017 benchmark functions to demonstrate the superior performance of GTMWOA in addressing intricate optimization problems. Furthermore, this study executed an experiment using Berkeley images and LN images to verify the superiority of GTMWOA in MIS. The ultimate outcomes of the MIS experiments substantiate the algorithm’s advanced capabilities and robustness in handling complex optimization problems.

狼疮性肾炎(LN)被认为是系统性红斑狼疮中最常见的一种。医学影像在诊断和治疗 LN 方面发挥着重要作用,可以帮助医生准确评估病变的范围和程度。然而,仅仅依靠肉眼观察和判断会带来主观性和误差,尤其是对于复杂的病理图像。图像分割技术可用于区分医学图像中的各种组织和结构,从而帮助医生进行诊断。多阈值图像分割(MIS)因其直接和实际的应用而得到广泛认可。然而,现有的多阈值图像分割方法仍存在一些问题。因此,本研究将非局部均值、二维直方图和二维仁义熵结合起来,以提高 MIS 方法的性能。此外,本研究还引入了鲸鱼优化算法(GTMWOA)的改进变体,以优化上述 MIS 方法并降低算法复杂度。GTMWOA 融合了高斯探索 (GE)、拓扑映射 (TM) 和磁液爬升 (MLC)。高斯探索有效提高了算法在局部探索方面的能力,并加快了收敛速度。TM有助于算法摆脱局部最优状态,而MLC机制则模拟了MLC的物理现象,提高了算法的收敛精度。本研究使用 IEEE CEC 2017 基准函数进行了一系列广泛的测试,以证明 GTMWOA 在解决复杂优化问题方面的卓越性能。此外,本研究还使用伯克利图像和 LN 图像进行了实验,以验证 GTMWOA 在 MIS 中的优越性。MIS 实验的最终结果证实了该算法在处理复杂优化问题时的先进能力和鲁棒性。
{"title":"Multi-threshold image segmentation using new strategies enhanced whale optimization for lupus nephritis pathological images","authors":"Jinge Shi ,&nbsp;Yi Chen ,&nbsp;Chaofan Wang ,&nbsp;Ali Asghar Heidari ,&nbsp;Lei Liu ,&nbsp;Huiling Chen ,&nbsp;Xiaowei Chen ,&nbsp;Li Sun","doi":"10.1016/j.displa.2024.102799","DOIUrl":"10.1016/j.displa.2024.102799","url":null,"abstract":"<div><p>Lupus Nephritis (LN) has been considered as the most prevalent form of systemic lupus erythematosus. Medical imaging plays an important role in diagnosing and treating LN, which can help doctors accurately assess the extent and extent of the lesion. However, relying solely on visual observation and judgment can introduce subjectivity and errors, especially for complex pathological images. Image segmentation techniques are used to differentiate various tissues and structures in medical images to assist doctors in diagnosis. Multi-threshold Image Segmentation (MIS) has gained widespread recognition for its direct and practical application. However, existing MIS methods still have some issues. Therefore, this study combines non-local means, 2D histogram, and 2D Renyi’s entropy to improve the performance of MIS methods. Additionally, this study introduces an improved variant of the Whale Optimization Algorithm (GTMWOA) to optimize the aforementioned MIS methods and reduce algorithm complexity. The GTMWOA fusions Gaussian Exploration (GE), Topology Mapping (TM), and Magnetic Liquid Climbing (MLC). The GE effectively amplifies the algorithm’s proficiency in local exploration and quickens the convergence rate. The TM facilitates the algorithm in escaping local optima, while the MLC mechanism emulates the physical phenomenon of MLC, refining the algorithm’s convergence precision. This study conducted an extensive series of tests using the IEEE CEC 2017 benchmark functions to demonstrate the superior performance of GTMWOA in addressing intricate optimization problems. Furthermore, this study executed an experiment using Berkeley images and LN images to verify the superiority of GTMWOA in MIS. The ultimate outcomes of the MIS experiments substantiate the algorithm’s advanced capabilities and robustness in handling complex optimization problems.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102799"},"PeriodicalIF":3.7,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141849492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A unified architecture for super-resolution and segmentation of remote sensing images based on similarity feature fusion 基于相似性特征融合的遥感图像超分辨率和分割统一架构
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-20 DOI: 10.1016/j.displa.2024.102800
Lunqian Wang , Xinghua Wang , Weilin Liu , Hao Ding , Bo Xia , Zekai Zhang , Jinglin Zhang , Sen Xu

The resolution of the image has an important impact on the accuracy of segmentation. Integrating super-resolution (SR) techniques in the semantic segmentation of remote sensing images contributes to the improvement of precision and accuracy, especially when the images are blurred. In this paper, a novel and efficient SR semantic segmentation network (SRSEN) is designed by taking advantage of the similarity between SR and segmentation tasks in feature processing. SRSEN consists of the multi-scale feature encoder, the SR fusion decoder, and the multi-path feature refinement block, which adaptively establishes the feature associations between segmentation and SR tasks to improve the segmentation accuracy of blurred images. Experiments show that the proposed method achieves higher segmentation accuracy on fuzzy images compared to state-of-the-art models. Specifically, the mIoU of the proposed SRSEN is 3%–6% higher than other state-of-the-art models on low-resolution LoveDa, Vaihingen, and Potsdam datasets.

图像的分辨率对分割的准确性有重要影响。在遥感图像的语义分割中集成超分辨率(SR)技术有助于提高精度和准确性,尤其是在图像模糊的情况下。本文利用 SR 与特征处理中的分割任务之间的相似性,设计了一种新颖高效的 SR 语义分割网络(SRSEN)。SRSEN 由多尺度特征编码器、SR 融合解码器和多路径特征细化块组成,可自适应地建立分割任务和 SR 任务之间的特征关联,从而提高模糊图像的分割精度。实验表明,与最先进的模型相比,所提出的方法在模糊图像上实现了更高的分割精度。具体来说,在低分辨率的 LoveDa、Vaihingen 和 Potsdam 数据集上,所提出的 SRSEN 的 mIoU 比其他先进模型高出 3%-6%。
{"title":"A unified architecture for super-resolution and segmentation of remote sensing images based on similarity feature fusion","authors":"Lunqian Wang ,&nbsp;Xinghua Wang ,&nbsp;Weilin Liu ,&nbsp;Hao Ding ,&nbsp;Bo Xia ,&nbsp;Zekai Zhang ,&nbsp;Jinglin Zhang ,&nbsp;Sen Xu","doi":"10.1016/j.displa.2024.102800","DOIUrl":"10.1016/j.displa.2024.102800","url":null,"abstract":"<div><p>The resolution of the image has an important impact on the accuracy of segmentation. Integrating super-resolution (SR) techniques in the semantic segmentation of remote sensing images contributes to the improvement of precision and accuracy, especially when the images are blurred. In this paper, a novel and efficient SR semantic segmentation network (SRSEN) is designed by taking advantage of the similarity between SR and segmentation tasks in feature processing. SRSEN consists of the multi-scale feature encoder, the SR fusion decoder, and the multi-path feature refinement block, which adaptively establishes the feature associations between segmentation and SR tasks to improve the segmentation accuracy of blurred images. Experiments show that the proposed method achieves higher segmentation accuracy on fuzzy images compared to state-of-the-art models. Specifically, the mIoU of the proposed SRSEN is 3%–6% higher than other state-of-the-art models on low-resolution LoveDa, Vaihingen, and Potsdam datasets.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102800"},"PeriodicalIF":3.7,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141849905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiF-DETR:Remote sensing object detection based on Bidirectional information fusion BiF-DETR:基于双向信息融合的遥感物体探测
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-19 DOI: 10.1016/j.displa.2024.102802
Zhijing Xu, Chao Wang, Kan Huang

Remote Sensing Object Detection(RSOD) is a fundamental task in the field of remote sensing image processing. The complexity of the background, the diversity of object scales and the locality limitation of Convolutional Neural Network (CNN) present specific challenges for RSOD. In this paper, an innovative hybrid detector, Bidirectional Information Fusion DEtection TRansformer (BiF-DETR), is proposed to mitigate the above issues. Specifically, BiF-DETR takes anchor-free detection network, CenterNet, as the baseline, designs the feature extraction backbone in parallel, extracts the local feature details using CNNs, and obtains the global information and long-term dependencies using Transformer branch. A Bidirectional Information Fusion (BIF) module is elaborately designed to reduce the semantic differences between different styles of feature maps through multi-level iterative information interactions, fully utilizing the complementary advantages of different detectors. Additionally, Coordination Attention(CA), is introduced to enables the detection network to focus on the saliency information of small objects. To address diversity insufficiency of remote sensing images in the training stage, Cascade Mixture Data Augmentation (CMDA), is designed to improve the robustness and generalization ability of the model. Comparative experiments with other cutting-edge methods are conducted on the publicly available DOTA and NWPU VHR-10 datasets. The experimental results reveal that the performance of proposed method is state-of-the-art, with mAP reaching 77.43% and 94.75%, respectively, far exceeding the other 25 competitive methods.

遥感物体检测(RSOD)是遥感图像处理领域的一项基本任务。背景的复杂性、物体尺度的多样性以及卷积神经网络(CNN)的定位限制,都给遥感物体检测带来了特殊的挑战。本文提出了一种创新的混合检测器--双向信息融合检测转换器(BiF-DETR),以缓解上述问题。具体来说,BiF-DETR 以无锚检测网络 CenterNet 为基线,并行设计特征提取骨干网,使用 CNN 提取局部特征细节,并使用 Transformer 分支获取全局信息和长期依赖关系。精心设计的双向信息融合(Bidirectional Information Fusion,BIF)模块通过多层次的迭代信息交互,充分利用不同检测器的互补优势,减少不同风格特征图之间的语义差异。此外,还引入了协调注意力(CA),使检测网络能够关注小物体的显著性信息。为解决训练阶段遥感图像多样性不足的问题,设计了级联混合数据增强(CMDA),以提高模型的鲁棒性和泛化能力。在公开的 DOTA 和 NWPU VHR-10 数据集上进行了与其他前沿方法的对比实验。实验结果表明,所提方法的性能达到了最先进水平,mAP 分别达到了 77.43% 和 94.75%,远远超过了其他 25 种竞争方法。
{"title":"BiF-DETR:Remote sensing object detection based on Bidirectional information fusion","authors":"Zhijing Xu,&nbsp;Chao Wang,&nbsp;Kan Huang","doi":"10.1016/j.displa.2024.102802","DOIUrl":"10.1016/j.displa.2024.102802","url":null,"abstract":"<div><p>Remote Sensing Object Detection(RSOD) is a fundamental task in the field of remote sensing image processing. The complexity of the background, the diversity of object scales and the locality limitation of Convolutional Neural Network (CNN) present specific challenges for RSOD. In this paper, an innovative hybrid detector, Bidirectional Information Fusion DEtection TRansformer (BiF-DETR), is proposed to mitigate the above issues. Specifically, BiF-DETR takes anchor-free detection network, CenterNet, as the baseline, designs the feature extraction backbone in parallel, extracts the local feature details using CNNs, and obtains the global information and long-term dependencies using Transformer branch. A Bidirectional Information Fusion (BIF) module is elaborately designed to reduce the semantic differences between different styles of feature maps through multi-level iterative information interactions, fully utilizing the complementary advantages of different detectors. Additionally, Coordination Attention(CA), is introduced to enables the detection network to focus on the saliency information of small objects. To address diversity insufficiency of remote sensing images in the training stage, Cascade Mixture Data Augmentation (CMDA), is designed to improve the robustness and generalization ability of the model. Comparative experiments with other cutting-edge methods are conducted on the publicly available DOTA and NWPU VHR-10 datasets. The experimental results reveal that the performance of proposed method is state-of-the-art, with <em>m</em>AP reaching 77.43% and 94.75%, respectively, far exceeding the other 25 competitive methods.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102802"},"PeriodicalIF":3.7,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0141938224001665/pdfft?md5=e3ed1b94823f012220f1a30a72ed7985&pid=1-s2.0-S0141938224001665-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141736394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FSNet: A dual-domain network for few-shot image classification FSNet:用于少量图像分类的双域网络
IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-07-14 DOI: 10.1016/j.displa.2024.102795
Xuewen Yan, Zhangjin Huang

Few-shot learning is a challenging task, that aims to learn and identify novel classes from a limited number of unseen labeled samples. Previous work has focused primarily on extracting features solely in the spatial domain of images. However, the compressed representation in the frequency domain which contains rich pattern information is a powerful tool in the field of signal processing. Combining the frequency and spatial domains to obtain richer information can effectively alleviate the overfitting problem. In this paper, we propose a dual-domain combined model called Frequency Space Net (FSNet), which preprocesses input images simultaneously in both the spatial and frequency domains, extracts spatial and frequency information through two feature extractors, and fuses them to a composite feature for image classification tasks. We start from a different view of frequency analysis, linking conventional average pooling to Discrete Cosine Transformation (DCT). We generalize the compression of the attention mechanism in the frequency domain. Consequently, we propose a novel Frequency Channel Spatial (FCS) attention mechanism. Extensive experiments demonstrate that frequency and spatial information are complementary in few-shot image classification, improving the performance of the model. Our method outperforms state-of-the-art approaches on miniImageNet and CUB.

少量学习是一项具有挑战性的任务,其目的是从数量有限的未见标注样本中学习和识别新的类别。以往的工作主要集中在仅提取图像空间域的特征。然而,频率域的压缩表示包含丰富的模式信息,是信号处理领域的有力工具。结合频域和空间域获取更丰富的信息可以有效缓解过拟合问题。本文提出了一种名为频率空间网(FSNet)的双域组合模型,它能同时在空间域和频率域对输入图像进行预处理,通过两个特征提取器提取空间和频率信息,并将它们融合为一个复合特征,用于图像分类任务。我们从频率分析的不同视角出发,将传统的平均集合与离散余弦变换(DCT)联系起来。我们在频域中对注意力机制的压缩进行了概括。因此,我们提出了一种新颖的频率通道空间(FCS)注意力机制。大量实验证明,频率和空间信息在少帧图像分类中是互补的,从而提高了模型的性能。我们的方法在 miniImageNet 和 CUB 上的表现优于最先进的方法。
{"title":"FSNet: A dual-domain network for few-shot image classification","authors":"Xuewen Yan,&nbsp;Zhangjin Huang","doi":"10.1016/j.displa.2024.102795","DOIUrl":"10.1016/j.displa.2024.102795","url":null,"abstract":"<div><p>Few-shot learning is a challenging task, that aims to learn and identify novel classes from a limited number of unseen labeled samples. Previous work has focused primarily on extracting features solely in the spatial domain of images. However, the compressed representation in the frequency domain which contains rich pattern information is a powerful tool in the field of signal processing. Combining the frequency and spatial domains to obtain richer information can effectively alleviate the overfitting problem. In this paper, we propose a dual-domain combined model called Frequency Space Net (FSNet), which preprocesses input images simultaneously in both the spatial and frequency domains, extracts spatial and frequency information through two feature extractors, and fuses them to a composite feature for image classification tasks. We start from a different view of frequency analysis, linking conventional average pooling to Discrete Cosine Transformation (DCT). We generalize the compression of the attention mechanism in the frequency domain. Consequently, we propose a novel Frequency Channel Spatial (FCS) attention mechanism. Extensive experiments demonstrate that frequency and spatial information are complementary in few-shot image classification, improving the performance of the model. Our method outperforms state-of-the-art approaches on miniImageNet and CUB.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102795"},"PeriodicalIF":3.7,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141636468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Displays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1