LCFFNet：一种用于人体姿态估计的轻量级跨尺度特征融合网络。

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2025-03-01 Epub Date: 2024-12-04 DOI:10.1016/j.neunet.2024.106959

Xuelian Zou , Xiaojun Bi

{"title":"LCFFNet：一种用于人体姿态估计的轻量级跨尺度特征融合网络。","authors":"Xuelian Zou , Xiaojun Bi","doi":"10.1016/j.neunet.2024.106959","DOIUrl":null,"url":null,"abstract":"<div><div>Human pose estimation is one of the most critical and challenging problems in computer vision. It is applied in many computer vision fields and has important research significance. However, it is still a difficult challenge to strike a balance between the number of parameters and computing load of the model and the accuracy of human pose estimation. In this study, we suggest a Lightweight Cross-scale Feature Fusion Network (LCFFNet) to strike a balance between accuracy and computational load and parameter volume. The Lightweight HRNet-Like (LHRNet) network, Cross-Resolution-Aware Semantics Module (CRASM), and Adapt Feature Fusion Module (AFFM) make up LCFFNet. To be more precise, first, we suggest a lightweight LHRNet network that includes Dynamic Multi-scale Convolution Basic (DMSC-Basic block) block, Basic block, and DMSC-Basic block submodules in the network’s three high-resolution subnetwork stages. The proposed dynamic multi-scale convolution in DMSC-Basic block can reduces the amount of model parameters and complexity of the LHRNet network, and has the ability to extract variable pose features. In order to maintain the model’s ability to express features, the Basic block is introduced. As a result, the LHRNet network not only makes the model more lightweight but also enhances its feature expression capabilities. Second, we propose a CRASM module to enhance contextual semantic information while reducing the semantic gap between different scales by fusing features from different scales. Finally, the augmented semantic feature map’s spatial resolution is finally restored from bottom to top using our suggested AFFM, and adaptive feature fusion is used to increase the positioning accuracy of important sites. Our method successfully predicts keypoints with 74.2 % AP, 89.9 % [email protected] and 66.9 % AP on the MSCOCO 2017, MPII and Crowdpose datasets, respectively. Our model reduces the number of parameters by 89.0 % and the computational complexity by 87.5 % compared with HRNet. The proposed network performs as well as current large-model human pose estimation networks while outperforming state-of the-art lightweight networks.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"183 ","pages":"Article 106959"},"PeriodicalIF":6.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LCFFNet: A Lightweight Cross-scale Feature Fusion Network for human pose estimation\",\"authors\":\"Xuelian Zou , Xiaojun Bi\",\"doi\":\"10.1016/j.neunet.2024.106959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Human pose estimation is one of the most critical and challenging problems in computer vision. It is applied in many computer vision fields and has important research significance. However, it is still a difficult challenge to strike a balance between the number of parameters and computing load of the model and the accuracy of human pose estimation. In this study, we suggest a Lightweight Cross-scale Feature Fusion Network (LCFFNet) to strike a balance between accuracy and computational load and parameter volume. The Lightweight HRNet-Like (LHRNet) network, Cross-Resolution-Aware Semantics Module (CRASM), and Adapt Feature Fusion Module (AFFM) make up LCFFNet. To be more precise, first, we suggest a lightweight LHRNet network that includes Dynamic Multi-scale Convolution Basic (DMSC-Basic block) block, Basic block, and DMSC-Basic block submodules in the network’s three high-resolution subnetwork stages. The proposed dynamic multi-scale convolution in DMSC-Basic block can reduces the amount of model parameters and complexity of the LHRNet network, and has the ability to extract variable pose features. In order to maintain the model’s ability to express features, the Basic block is introduced. As a result, the LHRNet network not only makes the model more lightweight but also enhances its feature expression capabilities. Second, we propose a CRASM module to enhance contextual semantic information while reducing the semantic gap between different scales by fusing features from different scales. Finally, the augmented semantic feature map’s spatial resolution is finally restored from bottom to top using our suggested AFFM, and adaptive feature fusion is used to increase the positioning accuracy of important sites. Our method successfully predicts keypoints with 74.2 % AP, 89.9 % [email protected] and 66.9 % AP on the MSCOCO 2017, MPII and Crowdpose datasets, respectively. Our model reduces the number of parameters by 89.0 % and the computational complexity by 87.5 % compared with HRNet. The proposed network performs as well as current large-model human pose estimation networks while outperforming state-of the-art lightweight networks.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"183 \",\"pages\":\"Article 106959\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608024008888\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/12/4 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024008888","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

人体姿态估计是计算机视觉中最关键、最具挑战性的问题之一。它被应用于许多计算机视觉领域，具有重要的研究意义。然而，如何在模型的参数数量和计算量与人体姿态估计精度之间取得平衡仍然是一个艰巨的挑战。在这项研究中，我们提出了一种轻量级的跨尺度特征融合网络（LCFFNet），以在精度、计算负荷和参数量之间取得平衡。LCFFNet由轻量级类hrnet （LHRNet）网络、跨分辨率感知语义模块（CRASM）和自适应特征融合模块（AFFM）组成。更准确地说，首先，我们提出了一个轻量级的LHRNet网络，该网络在三个高分辨率子网阶段中包括动态多尺度卷积基本（DMSC-Basic块）块、基本块和dmsc -基本块子模块。提出的DMSC-Basic块动态多尺度卷积可以减少LHRNet网络的模型参数数量和复杂度，并具有提取可变姿态特征的能力。为了保持模型表达特征的能力，引入了基本块。因此，LHRNet网络不仅使模型更加轻量级，而且增强了模型的特征表达能力。其次，我们提出了一个CRASM模块，通过融合不同尺度的特征来增强上下文语义信息，同时减少不同尺度之间的语义差距。最后，利用本文提出的AFFM从下到上还原增强语义特征图的空间分辨率，并利用自适应特征融合提高重要位置的定位精度。我们的方法在MSCOCO 2017、MPII和Crowdpose数据集上分别以74.2%、89.9% PCKh@0.5和66.9% AP成功预测关键点。与HRNet相比，我们的模型减少了89.0%的参数数量和87.5%的计算复杂度。该网络的性能与目前的大型模型人体姿态估计网络一样好，同时优于最先进的轻量级网络。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LCFFNet: A Lightweight Cross-scale Feature Fusion Network for human pose estimation

Human pose estimation is one of the most critical and challenging problems in computer vision. It is applied in many computer vision fields and has important research significance. However, it is still a difficult challenge to strike a balance between the number of parameters and computing load of the model and the accuracy of human pose estimation. In this study, we suggest a Lightweight Cross-scale Feature Fusion Network (LCFFNet) to strike a balance between accuracy and computational load and parameter volume. The Lightweight HRNet-Like (LHRNet) network, Cross-Resolution-Aware Semantics Module (CRASM), and Adapt Feature Fusion Module (AFFM) make up LCFFNet. To be more precise, first, we suggest a lightweight LHRNet network that includes Dynamic Multi-scale Convolution Basic (DMSC-Basic block) block, Basic block, and DMSC-Basic block submodules in the network’s three high-resolution subnetwork stages. The proposed dynamic multi-scale convolution in DMSC-Basic block can reduces the amount of model parameters and complexity of the LHRNet network, and has the ability to extract variable pose features. In order to maintain the model’s ability to express features, the Basic block is introduced. As a result, the LHRNet network not only makes the model more lightweight but also enhances its feature expression capabilities. Second, we propose a CRASM module to enhance contextual semantic information while reducing the semantic gap between different scales by fusing features from different scales. Finally, the augmented semantic feature map’s spatial resolution is finally restored from bottom to top using our suggested AFFM, and adaptive feature fusion is used to increase the positioning accuracy of important sites. Our method successfully predicts keypoints with 74.2 % AP, 89.9 % [email protected] and 66.9 % AP on the MSCOCO 2017, MPII and Crowdpose datasets, respectively. Our model reduces the number of parameters by 89.0 % and the computational complexity by 87.5 % compared with HRNet. The proposed network performs as well as current large-model human pose estimation networks while outperforming state-of the-art lightweight networks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.