Multiscale Transformer Hierarchically Embedded CNN Hybrid Network for Visible-Infrared Person Reidentification

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Internet of Things Journal Pub Date : 2024-11-20 DOI:10.1109/JIOT.2024.3503766

Suixin Liang;Jian Lu;Kaibing Zhang;Xiaogai Chen

{"title":"Multiscale Transformer Hierarchically Embedded CNN Hybrid Network for Visible-Infrared Person Reidentification","authors":"Suixin Liang;Jian Lu;Kaibing Zhang;Xiaogai Chen","doi":"10.1109/JIOT.2024.3503766","DOIUrl":null,"url":null,"abstract":"Visible-infrared person reidentification (VI-ReID) is considered a pivotal technology for intelligent security surveillance systems for the Internet of Things (IoT). For the VI-ReID task, one key challenge is extracting and fusing robust global and local pedestrian information to mitigate the intermodality discrepancy. Despite the significant success achieved by convolutional neural network (CNN)-based methods, the extraction of global pedestrian information is limited by their inherent properties, namely, local receptive fields and downsampling processes, making cross-modality information fusion difficult. While existing pure Transformer-based methods excel at capturing global pedestrian information, uniform-sized queries, keys, and values are employed by their core self-attention mechanism. This results in the acquisition of uniform-scale information only, thereby limiting the learning of multiscale information and preventing the full extraction of local pedestrian information. To address the aforementioned issues, a multiscale Transformer hierarchically embedded CNN hybrid network (MTECN) is proposed by us. MTECN enables the simultaneous extraction of pedestrian local and global information at different scales to mitigate the adverse impact on recognition caused by the discrepancy in features extracted across different modalities. Moreover, the effects of inherent factors, including camera viewpoint and illumination variations, are alleviated by incorporating a spatial consistency (SC) loss, which guides the network in exploring and discriminating the spatial structures of pedestrians across different modalities, consequently aligning the underlying spatial semantic information. Furthermore, in the low-light VI-ReID task, information insufficiency is encountered by the LLCM dataset due to low-light conditions. Consequently, a low-light enhancement (LLE) module is employed to restore the obscured detail information in low-light images, thereby further enhancing MTECN’s robust feature learning in complex backgrounds. To the best of our knowledge, this is the first work to use Transformer hierarchically embedded CNN networks for VI-ReID research, and the first to use LLE techniques for low-light VI-ReID task. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that the proposed MTECN method excels over several state-of-the-art methods.","PeriodicalId":54347,"journal":{"name":"IEEE Internet of Things Journal","volume":"12 7","pages":"9004-9018"},"PeriodicalIF":8.9000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Journal","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10759671/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Visible-infrared person reidentification (VI-ReID) is considered a pivotal technology for intelligent security surveillance systems for the Internet of Things (IoT). For the VI-ReID task, one key challenge is extracting and fusing robust global and local pedestrian information to mitigate the intermodality discrepancy. Despite the significant success achieved by convolutional neural network (CNN)-based methods, the extraction of global pedestrian information is limited by their inherent properties, namely, local receptive fields and downsampling processes, making cross-modality information fusion difficult. While existing pure Transformer-based methods excel at capturing global pedestrian information, uniform-sized queries, keys, and values are employed by their core self-attention mechanism. This results in the acquisition of uniform-scale information only, thereby limiting the learning of multiscale information and preventing the full extraction of local pedestrian information. To address the aforementioned issues, a multiscale Transformer hierarchically embedded CNN hybrid network (MTECN) is proposed by us. MTECN enables the simultaneous extraction of pedestrian local and global information at different scales to mitigate the adverse impact on recognition caused by the discrepancy in features extracted across different modalities. Moreover, the effects of inherent factors, including camera viewpoint and illumination variations, are alleviated by incorporating a spatial consistency (SC) loss, which guides the network in exploring and discriminating the spatial structures of pedestrians across different modalities, consequently aligning the underlying spatial semantic information. Furthermore, in the low-light VI-ReID task, information insufficiency is encountered by the LLCM dataset due to low-light conditions. Consequently, a low-light enhancement (LLE) module is employed to restore the obscured detail information in low-light images, thereby further enhancing MTECN’s robust feature learning in complex backgrounds. To the best of our knowledge, this is the first work to use Transformer hierarchically embedded CNN networks for VI-ReID research, and the first to use LLE techniques for low-light VI-ReID task. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that the proposed MTECN method excels over several state-of-the-art methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于可见光-红外人员再识别的多尺度变换器分层嵌入式 CNN 混合网络

可见红外人员再识别（VI-ReID）被认为是物联网（IoT）智能安全监控系统的关键技术。对于VI-ReID任务，一个关键的挑战是提取和融合鲁棒的全局和局部行人信息，以减轻多式联运差异。尽管基于卷积神经网络（CNN）的方法取得了显著的成功，但全局行人信息的提取受到其固有属性（即局部感受野和下采样过程）的限制，使得跨模态信息融合变得困难。虽然现有的纯基于transformer的方法擅长捕获全局行人信息，但它们的核心自关注机制采用了统一大小的查询、键和值。这导致只获取统一尺度的信息，从而限制了多尺度信息的学习，无法充分提取局部行人信息。为了解决上述问题，我们提出了一种多尺度变压器分层嵌入CNN混合网络（MTECN）。MTECN可以同时提取不同尺度的行人局部和全局信息，以减轻不同模式提取的特征差异对识别造成的不利影响。此外，通过引入空间一致性（SC）损失来减轻相机视点和光照变化等固有因素的影响，从而指导网络探索和区分不同模式下行人的空间结构，从而对齐潜在的空间语义信息。此外，在弱光VI-ReID任务中，LLCM数据集由于弱光条件会遇到信息不足的问题。因此，采用低光增强（LLE）模块来恢复低光图像中被遮挡的细节信息，从而进一步增强MTECN在复杂背景下的鲁棒性特征学习。据我们所知，这是第一个使用Transformer分层嵌入CNN网络进行VI-ReID研究的工作，也是第一个使用LLE技术进行低光VI-ReID任务的工作。在SYSU-MM01、RegDB和LLCM数据集上进行的大量实验表明，所提出的MTECN方法优于几种最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Internet of Things Journal Computer Science-Information Systems

CiteScore

17.60

自引率

13.20%

发文量

1982

期刊介绍： The EEE Internet of Things (IoT) Journal publishes articles and review articles covering various aspects of IoT, including IoT system architecture, IoT enabling technologies, IoT communication and networking protocols such as network coding, and IoT services and applications. Topics encompass IoT's impacts on sensor technologies, big data management, and future internet design for applications like smart cities and smart homes. Fields of interest include IoT architecture such as things-centric, data-centric, service-oriented IoT architecture; IoT enabling technologies and systematic integration such as sensor technologies, big sensor data management, and future Internet design for IoT; IoT services, applications, and test-beds such as IoT service middleware, IoT application programming interface (API), IoT application design, and IoT trials/experiments; IoT standardization activities and technology development in different standard development organizations (SDO) such as IEEE, IETF, ITU, 3GPP, ETSI, etc.