Multiscale Transformer Hierarchically Embedded CNN Hybrid Network for Visible-Infrared Person Reidentification

IF 8.9 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Internet of Things Journal Pub Date : 2024-11-20 DOI:10.1109/JIOT.2024.3503766
Suixin Liang;Jian Lu;Kaibing Zhang;Xiaogai Chen
{"title":"Multiscale Transformer Hierarchically Embedded CNN Hybrid Network for Visible-Infrared Person Reidentification","authors":"Suixin Liang;Jian Lu;Kaibing Zhang;Xiaogai Chen","doi":"10.1109/JIOT.2024.3503766","DOIUrl":null,"url":null,"abstract":"Visible-infrared person reidentification (VI-ReID) is considered a pivotal technology for intelligent security surveillance systems for the Internet of Things (IoT). For the VI-ReID task, one key challenge is extracting and fusing robust global and local pedestrian information to mitigate the intermodality discrepancy. Despite the significant success achieved by convolutional neural network (CNN)-based methods, the extraction of global pedestrian information is limited by their inherent properties, namely, local receptive fields and downsampling processes, making cross-modality information fusion difficult. While existing pure Transformer-based methods excel at capturing global pedestrian information, uniform-sized queries, keys, and values are employed by their core self-attention mechanism. This results in the acquisition of uniform-scale information only, thereby limiting the learning of multiscale information and preventing the full extraction of local pedestrian information. To address the aforementioned issues, a multiscale Transformer hierarchically embedded CNN hybrid network (MTECN) is proposed by us. MTECN enables the simultaneous extraction of pedestrian local and global information at different scales to mitigate the adverse impact on recognition caused by the discrepancy in features extracted across different modalities. Moreover, the effects of inherent factors, including camera viewpoint and illumination variations, are alleviated by incorporating a spatial consistency (SC) loss, which guides the network in exploring and discriminating the spatial structures of pedestrians across different modalities, consequently aligning the underlying spatial semantic information. Furthermore, in the low-light VI-ReID task, information insufficiency is encountered by the LLCM dataset due to low-light conditions. Consequently, a low-light enhancement (LLE) module is employed to restore the obscured detail information in low-light images, thereby further enhancing MTECN’s robust feature learning in complex backgrounds. To the best of our knowledge, this is the first work to use Transformer hierarchically embedded CNN networks for VI-ReID research, and the first to use LLE techniques for low-light VI-ReID task. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that the proposed MTECN method excels over several state-of-the-art methods.","PeriodicalId":54347,"journal":{"name":"IEEE Internet of Things Journal","volume":"12 7","pages":"9004-9018"},"PeriodicalIF":8.9000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Journal","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10759671/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Visible-infrared person reidentification (VI-ReID) is considered a pivotal technology for intelligent security surveillance systems for the Internet of Things (IoT). For the VI-ReID task, one key challenge is extracting and fusing robust global and local pedestrian information to mitigate the intermodality discrepancy. Despite the significant success achieved by convolutional neural network (CNN)-based methods, the extraction of global pedestrian information is limited by their inherent properties, namely, local receptive fields and downsampling processes, making cross-modality information fusion difficult. While existing pure Transformer-based methods excel at capturing global pedestrian information, uniform-sized queries, keys, and values are employed by their core self-attention mechanism. This results in the acquisition of uniform-scale information only, thereby limiting the learning of multiscale information and preventing the full extraction of local pedestrian information. To address the aforementioned issues, a multiscale Transformer hierarchically embedded CNN hybrid network (MTECN) is proposed by us. MTECN enables the simultaneous extraction of pedestrian local and global information at different scales to mitigate the adverse impact on recognition caused by the discrepancy in features extracted across different modalities. Moreover, the effects of inherent factors, including camera viewpoint and illumination variations, are alleviated by incorporating a spatial consistency (SC) loss, which guides the network in exploring and discriminating the spatial structures of pedestrians across different modalities, consequently aligning the underlying spatial semantic information. Furthermore, in the low-light VI-ReID task, information insufficiency is encountered by the LLCM dataset due to low-light conditions. Consequently, a low-light enhancement (LLE) module is employed to restore the obscured detail information in low-light images, thereby further enhancing MTECN’s robust feature learning in complex backgrounds. To the best of our knowledge, this is the first work to use Transformer hierarchically embedded CNN networks for VI-ReID research, and the first to use LLE techniques for low-light VI-ReID task. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that the proposed MTECN method excels over several state-of-the-art methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于可见光-红外人员再识别的多尺度变换器分层嵌入式 CNN 混合网络
可见红外人员再识别(VI-ReID)被认为是物联网(IoT)智能安全监控系统的关键技术。对于VI-ReID任务,一个关键的挑战是提取和融合鲁棒的全局和局部行人信息,以减轻多式联运差异。尽管基于卷积神经网络(CNN)的方法取得了显著的成功,但全局行人信息的提取受到其固有属性(即局部感受野和下采样过程)的限制,使得跨模态信息融合变得困难。虽然现有的纯基于transformer的方法擅长捕获全局行人信息,但它们的核心自关注机制采用了统一大小的查询、键和值。这导致只获取统一尺度的信息,从而限制了多尺度信息的学习,无法充分提取局部行人信息。为了解决上述问题,我们提出了一种多尺度变压器分层嵌入CNN混合网络(MTECN)。MTECN可以同时提取不同尺度的行人局部和全局信息,以减轻不同模式提取的特征差异对识别造成的不利影响。此外,通过引入空间一致性(SC)损失来减轻相机视点和光照变化等固有因素的影响,从而指导网络探索和区分不同模式下行人的空间结构,从而对齐潜在的空间语义信息。此外,在弱光VI-ReID任务中,LLCM数据集由于弱光条件会遇到信息不足的问题。因此,采用低光增强(LLE)模块来恢复低光图像中被遮挡的细节信息,从而进一步增强MTECN在复杂背景下的鲁棒性特征学习。据我们所知,这是第一个使用Transformer分层嵌入CNN网络进行VI-ReID研究的工作,也是第一个使用LLE技术进行低光VI-ReID任务的工作。在SYSU-MM01、RegDB和LLCM数据集上进行的大量实验表明,所提出的MTECN方法优于几种最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Internet of Things Journal
IEEE Internet of Things Journal Computer Science-Information Systems
CiteScore
17.60
自引率
13.20%
发文量
1982
期刊介绍: The EEE Internet of Things (IoT) Journal publishes articles and review articles covering various aspects of IoT, including IoT system architecture, IoT enabling technologies, IoT communication and networking protocols such as network coding, and IoT services and applications. Topics encompass IoT's impacts on sensor technologies, big data management, and future internet design for applications like smart cities and smart homes. Fields of interest include IoT architecture such as things-centric, data-centric, service-oriented IoT architecture; IoT enabling technologies and systematic integration such as sensor technologies, big sensor data management, and future Internet design for IoT; IoT services, applications, and test-beds such as IoT service middleware, IoT application programming interface (API), IoT application design, and IoT trials/experiments; IoT standardization activities and technology development in different standard development organizations (SDO) such as IEEE, IETF, ITU, 3GPP, ETSI, etc.
期刊最新文献
Scheduling Schemes for Mission-Critical IoT Healthcare Applications: A Systematic Review Physical Layer Security of Coupled Phase Shifts STAR-RIS-Aided NOMA System under Hybrid Far- and Near-Field Scenarios Feature Importance-Aware Deep Joint Source-Channel Coding for Computationally Efficient and Adjustable Image Transmission Blind Radio Map Construction via Topology Guided Manifold Learning Toward Robust IoT Device Authentication: Cross-Day Specific Emitter Identification via Domain Adaptation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1