Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-Modal Crowd Counting

IF 7.7 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Mobile Computing Pub Date : 2024-08-15 DOI:10.1109/TMC.2024.3444469
Lifei Hao;Baoqi Huang;Bing Jia;Guoqiang Mao
{"title":"Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-Modal Crowd Counting","authors":"Lifei Hao;Baoqi Huang;Bing Jia;Guoqiang Mao","doi":"10.1109/TMC.2024.3444469","DOIUrl":null,"url":null,"abstract":"Crowd counting aims to estimate the number of individuals in targeted areas. However, mainstream vision-based methods suffer from limited coverage and difficulty in multi-camera collaboration, which limits their scalability, whereas emerging WiFi-based methods can only obtain coarse results due to signal randomness. To overcome the inherent limitations of unimodal approaches and effectively exploit the advantage of multi-modal approaches, this paper presents an innovative WiFi and video-fused multi-modal paradigm by leveraging a heterogeneous dual-attentional network, which jointly models the intra- and inter-modality relationships of global WiFi measurements and local videos to achieve accurate and stable large-scale crowd counting. First, a flexible hybrid sensing network is constructed to capture synchronized multi-modal measurements characterizing the same crowd at different scales and perspectives; second, differential preprocessing, heterogeneous feature extractors, and self-attention mechanisms are sequentially utilized to extract and optimize modality-independent and crowd-related features; third, the cross-attention mechanism is employed to deeply fuse and generalize the matching relationships of two modalities. Extensive real-world experiments demonstrate that our method can significantly reduce the error by 26.2%, improve the stability by 48.43%, and achieve the accuracy of about 88% in large-scale crowd counting when including the videos from two cameras, compared to the best WiFi unimodal baseline.","PeriodicalId":50389,"journal":{"name":"IEEE Transactions on Mobile Computing","volume":"23 12","pages":"14233-14247"},"PeriodicalIF":7.7000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Mobile Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10637758/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Crowd counting aims to estimate the number of individuals in targeted areas. However, mainstream vision-based methods suffer from limited coverage and difficulty in multi-camera collaboration, which limits their scalability, whereas emerging WiFi-based methods can only obtain coarse results due to signal randomness. To overcome the inherent limitations of unimodal approaches and effectively exploit the advantage of multi-modal approaches, this paper presents an innovative WiFi and video-fused multi-modal paradigm by leveraging a heterogeneous dual-attentional network, which jointly models the intra- and inter-modality relationships of global WiFi measurements and local videos to achieve accurate and stable large-scale crowd counting. First, a flexible hybrid sensing network is constructed to capture synchronized multi-modal measurements characterizing the same crowd at different scales and perspectives; second, differential preprocessing, heterogeneous feature extractors, and self-attention mechanisms are sequentially utilized to extract and optimize modality-independent and crowd-related features; third, the cross-attention mechanism is employed to deeply fuse and generalize the matching relationships of two modalities. Extensive real-world experiments demonstrate that our method can significantly reduce the error by 26.2%, improve the stability by 48.43%, and achieve the accuracy of about 88% in large-scale crowd counting when including the videos from two cameras, compared to the best WiFi unimodal baseline.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于 WiFi 和视频融合多模式人群计数的异构双注意网络
人群计数旨在估算目标区域的人数。然而,主流的基于视觉的方法存在覆盖范围有限、多摄像头协作困难等问题,限制了其可扩展性,而新兴的基于 WiFi 的方法由于信号随机性,只能获得粗略的结果。为了克服单模态方法的固有局限性,有效发挥多模态方法的优势,本文提出了一种创新的 WiFi 和视频融合的多模态范式,利用异构双感知网络,对全局 WiFi 测量和局部视频的模内和模间关系进行联合建模,从而实现精确稳定的大规模人群计数。首先,构建一个灵活的混合传感网络,以捕捉在不同尺度和视角下表征同一人群的同步多模态测量数据;其次,依次利用差分预处理、异构特征提取器和自我注意机制来提取和优化与模态无关的和与人群相关的特征;第三,利用交叉注意机制来深度融合和概括两种模态的匹配关系。广泛的实际实验证明,与最佳的 WiFi 单模态基线相比,我们的方法在包含两个摄像头的视频时,能显著减少 26.2% 的误差,提高 48.43% 的稳定性,并在大规模人群计数中达到约 88% 的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Mobile Computing
IEEE Transactions on Mobile Computing 工程技术-电信学
CiteScore
12.90
自引率
2.50%
发文量
403
审稿时长
6.6 months
期刊介绍: IEEE Transactions on Mobile Computing addresses key technical issues related to various aspects of mobile computing. This includes (a) architectures, (b) support services, (c) algorithm/protocol design and analysis, (d) mobile environments, (e) mobile communication systems, (f) applications, and (g) emerging technologies. Topics of interest span a wide range, covering aspects like mobile networks and hosts, mobility management, multimedia, operating system support, power management, online and mobile environments, security, scalability, reliability, and emerging technologies such as wearable computers, body area networks, and wireless sensor networks. The journal serves as a comprehensive platform for advancements in mobile computing research.
期刊最新文献
Efficient Coordination of Federated Learning and Inference Offloading at the Edge: A Proactive Optimization Paradigm Multi-User Task Offloading in UAV-Assisted LEO Satellite Edge Computing: A Game-Theoretic Approach Model Decomposition and Reassembly for Purified Knowledge Transfer in Personalized Federated Learning FedCRAC: Improving Federated Classification Performance on Long-Tailed Data via Classifier Representation Adjustment and Calibration Scrava: Super Resolution-Based Bandwidth-Efficient Cross-Camera Video Analytics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1