首页 > 最新文献

Signal Processing-Image Communication最新文献

英文 中文
U-MobileViT: A Lightweight Vision Transformer-based Backbone for Panoptic Driving Segmentation U-MobileViT:基于轻型视觉变压器的全光驾驶分割主干
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-12-23 DOI: 10.1016/j.image.2025.117461
Phuoc-Thinh Nguyen , The-Bang Nguyen , Phu Pham , Quang-Thinh Bui
Panoramic driving perception requires robust and efficient context understanding, which requires simultaneous semantic and instance segmentation. This paper proposes U-MobileViT, a lightweight backbone network designed to address this challenge. Our architecture combines the advantages of MobileViT, a family of Transformer-based models with high accuracy and fast processing speed, with the image segmentation structure of the U-Net model, facilitating multiscale feature fusion and accurate localization. U-MobileViT efficiently combines local and global spatial information by utilizing MobileViT Blocks with Separable-Attention layers, resulting in a computationally lightweight yet effective architecture, while the U-Net structure enables efficient integration of features from different levels of the hierarchy. This synergistic combination enables the generation of rich, context-aware feature maps that are critical for accurate panoramic segmentation. Through extensive experiments on the challenging BDD100K driving dataset, we demonstrate that U-MobileViT achieves state-of-the-art performance in panoramic driving perception, outperforming existing lightweight models in both accuracy and inference speed. Our results demonstrate the potential of U-MobileViT as a robust and efficient backbone for real-time panoramic scene understanding in autonomous driving applications. Code is available at https://github.com/quyongkeomut/UMobileViT.
全景驾驶感知需要鲁棒和高效的上下文理解,需要同时进行语义和实例分割。本文提出了U-MobileViT,一种轻量级骨干网络,旨在解决这一挑战。我们的架构将MobileViT(基于transformer的一系列模型,具有高精度和快速处理速度)的优势与U-Net模型的图像分割结构相结合,便于多尺度特征融合和精确定位。U-MobileViT通过利用MobileViT块和可分离的注意层有效地结合了本地和全局空间信息,从而产生了计算轻量级但有效的体系结构,而U-Net结构能够有效地集成来自不同层次结构的特征。这种协同组合能够生成丰富的、上下文感知的特征地图,这对于准确的全景分割至关重要。通过在具有挑战性的BDD100K驾驶数据集上进行的大量实验,我们证明U-MobileViT在全景驾驶感知方面实现了最先进的性能,在准确性和推理速度方面都优于现有的轻量级模型。我们的研究结果证明了U-MobileViT作为自动驾驶应用中实时全景场景理解的强大而高效的骨干的潜力。代码可从https://github.com/quyongkeomut/UMobileViT获得。
{"title":"U-MobileViT: A Lightweight Vision Transformer-based Backbone for Panoptic Driving Segmentation","authors":"Phuoc-Thinh Nguyen ,&nbsp;The-Bang Nguyen ,&nbsp;Phu Pham ,&nbsp;Quang-Thinh Bui","doi":"10.1016/j.image.2025.117461","DOIUrl":"10.1016/j.image.2025.117461","url":null,"abstract":"<div><div>Panoramic driving perception requires robust and efficient context understanding, which requires simultaneous semantic and instance segmentation. This paper proposes U-MobileViT, a lightweight backbone network designed to address this challenge. Our architecture combines the advantages of MobileViT, a family of Transformer-based models with high accuracy and fast processing speed, with the image segmentation structure of the U-Net model, facilitating multiscale feature fusion and accurate localization. U-MobileViT efficiently combines local and global spatial information by utilizing MobileViT Blocks with Separable-Attention layers, resulting in a computationally lightweight yet effective architecture, while the U-Net structure enables efficient integration of features from different levels of the hierarchy. This synergistic combination enables the generation of rich, context-aware feature maps that are critical for accurate panoramic segmentation. Through extensive experiments on the challenging BDD100K driving dataset, we demonstrate that U-MobileViT achieves state-of-the-art performance in panoramic driving perception, outperforming existing lightweight models in both accuracy and inference speed. Our results demonstrate the potential of U-MobileViT as a robust and efficient backbone for real-time panoramic scene understanding in autonomous driving applications. Code is available at <span><span>https://github.com/quyongkeomut/UMobileViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117461"},"PeriodicalIF":2.7,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145824219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning model with co-ordinated relationship for image captioning enabled via attentional language encoder-decoder 具有协调关系的深度学习模型,通过注意语言编码器-解码器实现图像字幕
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-12-23 DOI: 10.1016/j.image.2025.117466
Shaheen Raphiahmed Mujawar , Sridhar Iyer
The development of an image captioning system could make the world more accessible to persons who are blind. Recently, researchers have focused on the need to create automatic textual descriptions associated with observed images. However, in computer vision and natural language processing, autonomously creating captions for images is difficult. Hence, this article proposes an efficient automatic image caption with an attentional language encoder-decoder framework enabled by Deep Learning (DL) models. The developed model integrates four main strategies: the Feature Extractor Encoder Module (FEEM), the Co-ordinated Relationship Learning Module (CRLM), the Attentional Feature Fusion Module (AFFM), and the Language Decoder Module. The region and semantic-based feature extraction of the image can be ensured by utilizing the Res-Inception and Convolutional Neural Network (CNN) model. Moreover, CRLM is introduced to generate balanced relationship features, and AFFM is used to fuse various levels of visual information and selectively focus on particular visual regions associated with each word prediction. An Attentional Model with Residual BiGRU (ARBiGRU) is implemented as a language model for decoding to identify the correct caption for the input image effectively. The developed model utilizes the flickr8k and flickr30k datasets, respectively. To examine the achievement of the projected work, caption metrics such as BLEU, METEOR, CIDER, and ROUGE-L are used. To evaluate the effectiveness of the proposed model, an ablation study is conducted using six cases, and the performance analysis demonstrates that the proposed approach outpaces the existing techniques in caption generation.
图像字幕系统的开发可以使盲人更容易进入这个世界。最近,研究人员关注的是与观察到的图像相关联的自动文本描述的需求。然而,在计算机视觉和自然语言处理中,自主地为图像创建字幕是困难的。因此,本文提出了一种高效的自动图像标题,采用深度学习(DL)模型支持的注意力语言编码器-解码器框架。该模型集成了四种主要策略:特征提取编码器模块(FEEM)、协调关系学习模块(CRLM)、注意特征融合模块(AFFM)和语言解码器模块。利用re - inception和卷积神经网络(CNN)模型可以保证图像的区域和基于语义的特征提取。此外,引入CRLM生成平衡关系特征,使用AFFM融合不同层次的视觉信息,并有选择地关注与每个单词预测相关的特定视觉区域。为了有效识别输入图像的正确标题,实现了带有残差BiGRU的注意力模型(ARBiGRU)作为解码的语言模型。所建立的模型分别使用了flickr8k和flickr30k数据集。为了检查预期工作的成果,使用了诸如BLEU、METEOR、CIDER和ROUGE-L之类的标题度量。为了评估所提出模型的有效性,使用六个案例进行了消融研究,性能分析表明,所提出的方法在标题生成方面优于现有技术。
{"title":"Deep learning model with co-ordinated relationship for image captioning enabled via attentional language encoder-decoder","authors":"Shaheen Raphiahmed Mujawar ,&nbsp;Sridhar Iyer","doi":"10.1016/j.image.2025.117466","DOIUrl":"10.1016/j.image.2025.117466","url":null,"abstract":"<div><div>The development of an image captioning system could make the world more accessible to persons who are blind. Recently, researchers have focused on the need to create automatic textual descriptions associated with observed images. However, in computer vision and natural language processing, autonomously creating captions for images is difficult. Hence, this article proposes an efficient automatic image caption with an attentional language encoder-decoder framework enabled by Deep Learning (DL) models. The developed model integrates four main strategies: the Feature Extractor Encoder Module (FEEM), the Co-ordinated Relationship Learning Module (CRLM), the Attentional Feature Fusion Module (AFFM), and the Language Decoder Module. The region and semantic-based feature extraction of the image can be ensured by utilizing the Res-Inception and Convolutional Neural Network (CNN) model. Moreover, CRLM is introduced to generate balanced relationship features, and AFFM is used to fuse various levels of visual information and selectively focus on particular visual regions associated with each word prediction. An Attentional Model with Residual BiGRU (ARBiGRU) is implemented as a language model for decoding to identify the correct caption for the input image effectively. The developed model utilizes the flickr8k and flickr30k datasets, respectively. To examine the achievement of the projected work, caption metrics such as BLEU, METEOR, CIDER, and ROUGE-L are used. To evaluate the effectiveness of the proposed model, an ablation study is conducted using six cases, and the performance analysis demonstrates that the proposed approach outpaces the existing techniques in caption generation.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117466"},"PeriodicalIF":2.7,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learnable token for visual tracking 可学习的标记视觉跟踪
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-12-23 DOI: 10.1016/j.image.2025.117465
Yan Chen , Zhongkang Jiang , Jixiang Du , Hongbo Zhang
High-quality fusion of template and search frames is essential for effective visual object tracking. However, mainstream Transformer-based trackers, whether dual-stream or single-stream, often fuse these frames indiscriminately, allowing background noise to disrupt target-specific feature extraction. To address this, we propose LTTrack(learnable token for visual tracking), an adaptive feature fusion method based on a Transformer architecture with an autoregressive encoder–decoder structure. The core innovation is a learnable token in the encoder, which processes three inputs: search tokens, template tokens, and the learnable token. This token is designed to interact with the template, enabling precise fusion and extraction of target-relevant features. Our approach adaptively fuses search and template tokens, and extensive experiments show LTTrack achieves state-of-the-art performance across six challenging benchmarks.
模板和搜索框架的高质量融合对于有效的视觉目标跟踪至关重要。然而,主流的基于transformer的跟踪器,无论是双流还是单流,经常不加区分地融合这些帧,允许背景噪声干扰目标特定的特征提取。为了解决这个问题,我们提出了LTTrack(用于视觉跟踪的可学习标记),这是一种基于Transformer架构的自适应特征融合方法,具有自回归编码器-解码器结构。核心创新是编码器中的可学习令牌,它处理三个输入:搜索令牌,模板令牌和可学习令牌。该令牌设计用于与模板交互,从而实现目标相关特征的精确融合和提取。我们的方法自适应地融合了搜索和模板令牌,大量的实验表明,LTTrack在六个具有挑战性的基准测试中实现了最先进的性能。
{"title":"Learnable token for visual tracking","authors":"Yan Chen ,&nbsp;Zhongkang Jiang ,&nbsp;Jixiang Du ,&nbsp;Hongbo Zhang","doi":"10.1016/j.image.2025.117465","DOIUrl":"10.1016/j.image.2025.117465","url":null,"abstract":"<div><div>High-quality fusion of template and search frames is essential for effective visual object tracking. However, mainstream Transformer-based trackers, whether dual-stream or single-stream, often fuse these frames indiscriminately, allowing background noise to disrupt target-specific feature extraction. To address this, we propose LTTrack(learnable token for visual tracking), an adaptive feature fusion method based on a Transformer architecture with an autoregressive encoder–decoder structure. The core innovation is a learnable token in the encoder, which processes three inputs: search tokens, template tokens, and the learnable token. This token is designed to interact with the template, enabling precise fusion and extraction of target-relevant features. Our approach adaptively fuses search and template tokens, and extensive experiments show LTTrack achieves state-of-the-art performance across six challenging benchmarks.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117465"},"PeriodicalIF":2.7,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPMMN: A dual performer-multi-modal network for emotion recognition DPMMN:一种用于情感识别的双表演者-多模态网络
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-12-22 DOI: 10.1016/j.image.2025.117464
Shivanand S. Gornale , Shivakumara Palaiahnakote , Amruta Unki , Sunil Vadera

Background and Objective

Although emotion recognition systems have been widely advocated, their accuracy can be affected when a person’s normal facial features overlap with their expressions when in a particular emotional state. This study, therefore, explores how heatmaps of electroencephalography (EEG) signals can be integrated with facial information to improve the accuracy of emotion recognition systems.

Method

The key idea of the proposed work is to fuse EEG signal heatmaps and Facial information for recognizing eight different emotions. For implementing this new idea, we propose a Dual Performer Multi-Modal Network (DPMMN). For each modality, the proposed work integrates modified Vision Transformer (ViT) and Long Short-Term Memory (LSTM). The integration is achieved by concatenating the features extracted from each modality and using them to classify the different emotions. In contrast to a baseline ViT, which uses self-attention layers, the proposed work replaces self-attention layers with the Performer layers through a kernelized attention approach. This results in extracting distinct visual features from EEG signal heatmaps and facial images. Similarly, for capturing temporal features from EEG heatmaps and Facial videos, the proposed LSTM replaces a traditional feed-forward network with a recurrent structure. This step helps to learn sequential dependencies across the patches.

Results

A comprehensive evaluation of DPMMN with respect to current state-of-the-art systems shows favorable results, with DPMN achieving 97.02 % in identifying eight distinct emotions on the DEAP benchmark dataset.

Conclusion

The proposed work shows that the use of EEG signal heatmap with facial information is better than EEG signal and facial information alone. Similarly, integrating performer layers with ViT and LSTM is better than existing models for extracting distinct features to classify eight emotions.
背景与目的虽然情绪识别系统已经被广泛提倡,但当一个人在特定情绪状态下的正常面部特征与他们的表情重叠时,其准确性会受到影响。因此,本研究探讨了如何将脑电图(EEG)信号的热图与面部信息相结合,以提高情绪识别系统的准确性。方法将脑电信号热图与人脸信息融合,实现对8种不同情绪的识别。为了实现这一新的思想,我们提出了一个双执行者多模态网络(DPMMN)。对于每种模式,所提出的工作都集成了改进的视觉变压器(ViT)和长短期记忆(LSTM)。整合是通过连接从每个模态中提取的特征并使用它们对不同的情绪进行分类来实现的。与使用自注意层的基线ViT相比,建议的工作通过核注意方法将自注意层替换为执行者层。这样可以从脑电信号热图和面部图像中提取出不同的视觉特征。同样,对于从EEG热图和面部视频中捕获时间特征,所提出的LSTM用循环结构取代了传统的前馈网络。这一步有助于了解补丁之间的顺序依赖关系。结果dppmn相对于当前最先进的系统的综合评估显示出良好的结果,DPMN在DEAP基准数据集上识别八种不同情绪的成功率达到97.02%。结论脑电信号热图与面部信息的结合优于单纯的脑电信号和面部信息。同样,将表演者层与ViT和LSTM相结合,可以比现有模型更好地提取出不同的特征来对八种情绪进行分类。
{"title":"DPMMN: A dual performer-multi-modal network for emotion recognition","authors":"Shivanand S. Gornale ,&nbsp;Shivakumara Palaiahnakote ,&nbsp;Amruta Unki ,&nbsp;Sunil Vadera","doi":"10.1016/j.image.2025.117464","DOIUrl":"10.1016/j.image.2025.117464","url":null,"abstract":"<div><h3>Background and Objective</h3><div>Although emotion recognition systems have been widely advocated, their accuracy can be affected when a person’s normal facial features overlap with their expressions when in a particular emotional state. This study, therefore, explores how heatmaps of electroencephalography (EEG) signals can be integrated with facial information to improve the accuracy of emotion recognition systems.</div></div><div><h3>Method</h3><div>The key idea of the proposed work is to fuse EEG signal heatmaps and Facial information for recognizing eight different emotions. For implementing this new idea, we propose a Dual Performer Multi-Modal Network (DPMMN). For each modality, the proposed work integrates modified Vision Transformer (ViT) and Long Short-Term Memory (LSTM). The integration is achieved by concatenating the features extracted from each modality and using them to classify the different emotions. In contrast to a baseline ViT, which uses self-attention layers, the proposed work replaces self-attention layers with the Performer layers through a kernelized attention approach. This results in extracting distinct visual features from EEG signal heatmaps and facial images. Similarly, for capturing temporal features from EEG heatmaps and Facial videos, the proposed LSTM replaces a traditional feed-forward network with a recurrent structure. This step helps to learn sequential dependencies across the patches.</div></div><div><h3>Results</h3><div>A comprehensive evaluation of DPMMN with respect to current state-of-the-art systems shows favorable results, with DPMN achieving 97.02 % in identifying eight distinct emotions on the DEAP benchmark dataset.</div></div><div><h3>Conclusion</h3><div>The proposed work shows that the use of EEG signal heatmap with facial information is better than EEG signal and facial information alone. Similarly, integrating performer layers with ViT and LSTM is better than existing models for extracting distinct features to classify eight emotions.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117464"},"PeriodicalIF":2.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of image aesthetics assessment as a positive-unlabelled problem 图像美学评价作为一个积极的无标签问题的分析
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-12-01 DOI: 10.1016/j.image.2025.117441
Luis Gonzalez-Naharro, M. Julia Flores, Jesus Martínez-Gómez, Jose M. Puerta
Image aesthetics assessment (IAA) has been traditionally addressed as a supervised learning problem, where the goal is to accurately predict information related to user opinions, such as the mean opinion score, image ratings, or a binary quality label, usually crafted by using a mean score threshold to label images as highly or lowly aesthetic.
Supervised approaches fail to take into account the subjectiveness of this problem, as the idea of aesthetic pleasantness varies among different people and different cultures, thus making the labels extremely noisy. However, the existence of worldwide photographic contests, exhibitions and masters implies that, to a reasonable degree, there is a broader consensus about the quality of very high-quality images and photographs. Furthermore, labelling image data for IAA is a difficult process, as a large amount of non-trivial aesthetic judgements are required for obtaining a large-scale IAA dataset.
Therefore, in this work we analyse the potential of positive-unlabelled techniques for solving IAA. We propose techniques for building PU datasets from traditional IAA datasets and from available reference datasets of high-quality images, and test several well-known PU algorithms on these. Our results highlight the potential of PU approaches for IAA, as we obtain results close to the state-of-the-art with much smaller sets of labelled data: in experiments with only 5% of labelled in AVA, we reach accuracy levels only 0.03 points below NIMA, and we reach competent balanced accuracy levels in settings with a very limited amount of labelled data and with very simple models.
图像美学评估(IAA)传统上被认为是一个有监督的学习问题,其目标是准确地预测与用户意见相关的信息,例如平均意见得分、图像评级或二进制质量标签,通常通过使用平均得分阈值来标记图像的高度或低审美来制作。监督方法没有考虑到这个问题的主观性,因为审美愉悦的概念在不同的人和不同的文化中是不同的,因此使得标签非常嘈杂。然而,世界范围内摄影比赛、展览和大师的存在意味着,在一定程度上,对于非常高质量的图像和照片的质量有着更广泛的共识。此外,标记图像数据的IAA是一个困难的过程,因为获得大规模的IAA数据集需要大量的非琐碎的美学判断。因此,在这项工作中,我们分析了正无标记技术解决IAA的潜力。我们提出了从传统的IAA数据集和可用的高质量图像参考数据集构建PU数据集的技术,并在这些数据集上测试了几种知名的PU算法。我们的结果突出了PU方法在IAA中的潜力,因为我们用更小的标记数据集获得了接近最先进的结果:在AVA中只有5%标记的实验中,我们达到的精度水平仅比NIMA低0.03点,并且我们在标记数据量非常有限和非常简单的模型设置中达到了足够的平衡精度水平。
{"title":"Analysis of image aesthetics assessment as a positive-unlabelled problem","authors":"Luis Gonzalez-Naharro,&nbsp;M. Julia Flores,&nbsp;Jesus Martínez-Gómez,&nbsp;Jose M. Puerta","doi":"10.1016/j.image.2025.117441","DOIUrl":"10.1016/j.image.2025.117441","url":null,"abstract":"<div><div>Image aesthetics assessment (IAA) has been traditionally addressed as a supervised learning problem, where the goal is to accurately predict information related to user opinions, such as the mean opinion score, image ratings, or a binary quality label, usually crafted by using a mean score threshold to label images as highly or lowly aesthetic.</div><div>Supervised approaches fail to take into account the subjectiveness of this problem, as the idea of aesthetic pleasantness varies among different people and different cultures, thus making the labels extremely noisy. However, the existence of worldwide photographic contests, exhibitions and masters implies that, to a reasonable degree, there is a broader consensus about the quality of very high-quality images and photographs. Furthermore, labelling image data for IAA is a difficult process, as a large amount of non-trivial aesthetic judgements are required for obtaining a large-scale IAA dataset.</div><div>Therefore, in this work we analyse the potential of positive-unlabelled techniques for solving IAA. We propose techniques for building PU datasets from traditional IAA datasets and from available reference datasets of high-quality images, and test several well-known PU algorithms on these. Our results highlight the potential of PU approaches for IAA, as we obtain results close to the state-of-the-art with much smaller sets of labelled data: in experiments with only 5% of labelled in AVA, we reach accuracy levels only 0.03 points below NIMA, and we reach competent balanced accuracy levels in settings with a very limited amount of labelled data and with very simple models.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117441"},"PeriodicalIF":2.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tri-modal fusion for dynamic hand gesture recognition: Integrating RGB, depth, and skeleton data 动态手势识别的三模态融合:集成RGB、深度和骨架数据
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-11-25 DOI: 10.1016/j.image.2025.117440
Reena Tripathi, Bindu Verma
Computer vision applications continue to show a strong interest in dynamic hand gesture recognition because of its wide range of applications in automation, human–computer interaction, and other fields. There are several challenges in dynamic hand gesture recognition, such as occlusion and background clutter, which makes gesture tracking and classification difficult. To solve this, we use a fusion of multiple modality concepts in our proposed work and each modality has its advantages. The first modality utilizes RGB data. It gives spatial information that helps interpret the gesturing hand’s shape, texture, and color information. The second modality employs depth data, which records activity motion. The third modality incorporates skeleton data. The challenges of complex backgrounds and occlusion are resolved by using skeletal data. The features are extracted parallelly from all modalities using a pre-trained ResCLIP model. In sequence-to-sequence learning, the LSTM unit processes the generated feature vectors. At the feature level, the features from all three LSTM networks are concatenated before the fully connected (FC) layer and SoftMax function is employed to classify the gestures. The proposed model was applied to two benchmark datasets, demonstrating its effectiveness, the First-person hand action dataset (FPHA) and the Sheffield Kinect Gesture dataset (SKIG). The proposed work outperformed the state-of-the-art techniques on the FPHA and SKIG datasets, exhibiting competitive performance.
由于动态手势识别在自动化、人机交互等领域的广泛应用,计算机视觉应用不断对动态手势识别表现出浓厚的兴趣。在动态手势识别中存在着遮挡和背景杂波等问题,这给手势跟踪和分类带来了困难。为了解决这个问题,我们在我们提出的工作中使用了多种模态概念的融合,每种模态都有其优点。第一种模式利用RGB数据。它提供空间信息,帮助解释手势的形状、纹理和颜色信息。第二种方式采用深度数据,记录活动运动。第三种模式结合了骨架数据。利用骨骼数据解决了复杂背景和遮挡的难题。使用预训练的ResCLIP模型从所有模态中并行提取特征。在序列到序列学习中,LSTM单元处理生成的特征向量。在特征层,在完全连接(FC)层之前,将所有三个LSTM网络的特征进行连接,并使用SoftMax函数对手势进行分类。将该模型应用于两个基准数据集,即第一人称手部动作数据集(FPHA)和Sheffield Kinect手势数据集(SKIG),证明了其有效性。所提出的工作在FPHA和SKIG数据集上的表现优于最先进的技术,表现出具有竞争力的表现。
{"title":"Tri-modal fusion for dynamic hand gesture recognition: Integrating RGB, depth, and skeleton data","authors":"Reena Tripathi,&nbsp;Bindu Verma","doi":"10.1016/j.image.2025.117440","DOIUrl":"10.1016/j.image.2025.117440","url":null,"abstract":"<div><div>Computer vision applications continue to show a strong interest in dynamic hand gesture recognition because of its wide range of applications in automation, human–computer interaction, and other fields. There are several challenges in dynamic hand gesture recognition, such as occlusion and background clutter, which makes gesture tracking and classification difficult. To solve this, we use a fusion of multiple modality concepts in our proposed work and each modality has its advantages. The first modality utilizes RGB data. It gives spatial information that helps interpret the gesturing hand’s shape, texture, and color information. The second modality employs depth data, which records activity motion. The third modality incorporates skeleton data. The challenges of complex backgrounds and occlusion are resolved by using skeletal data. The features are extracted parallelly from all modalities using a pre-trained ResCLIP model. In sequence-to-sequence learning, the LSTM unit processes the generated feature vectors. At the feature level, the features from all three LSTM networks are concatenated before the fully connected (FC) layer and SoftMax function is employed to classify the gestures. The proposed model was applied to two benchmark datasets, demonstrating its effectiveness, the First-person hand action dataset (FPHA) and the Sheffield Kinect Gesture dataset (SKIG). The proposed work outperformed the state-of-the-art techniques on the FPHA and SKIG datasets, exhibiting competitive performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117440"},"PeriodicalIF":2.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimization model for sign language recognition using hybrid convolution networks 基于混合卷积网络的手语识别优化模型
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-11-25 DOI: 10.1016/j.image.2025.117444
S. Venkatesh , Pravin R. Kshirsagar , R. Thiagarajan , Tan Kuan Tak , B. Sivaneasan
Sign Language Gestures Recognition model seeks to provide effective communication for people by converting their sign language motions into spoken or written language, allowing them to connect with non-signers. The deep features are extracted using Vision Transformer-YOLOv5 (ViT-YOLOv5), this model is employed to extract the Regions of Interest (ROI) from the images to generate the first set of features as F1. Concurrently, Scale-Invariant Feature Transform (SIFT) is used to extract a second set of features as F2 from the same images. The two extracted features are fed into a Hybrid Convolution-based Adaptive EfficientB7 Network (HCA-EfB7N). In this network, F1 is processed using 1D convolution, while F2 is processed using 2D convolution to obtain the recognized outcome. By utilizing both 1D and 2D convolutions, this proposed model accurately identifies the class of hand gestures, leading to more accurate recognition. The parameters of the HCA-EffB7 network are optimized using the Fitness-based Archimedes Optimization Algorithm (FAOA). This hybrid approach recognizes complex hand gestures, particularly in the sign language translation system. The proposed approach’s effectiveness is validated by comparing its performance against several baseline systems, thereby confirming its superiority and robustness in recognizing sign language gestures.
手语手势识别模型旨在通过将人们的手势动作转换为口头或书面语言,使他们能够与不使用手语的人进行交流,从而为人们提供有效的沟通。使用Vision Transformer-YOLOv5 (viti - yolov5)提取深度特征,利用该模型从图像中提取感兴趣区域(ROI),生成第一组特征作为F1。同时,使用尺度不变特征变换(SIFT)从相同的图像中提取第二组特征作为F2。将提取的两个特征输入到基于混合卷积的自适应高效b7网络(HCA-EfB7N)中。在该网络中,F1采用一维卷积处理,F2采用二维卷积处理,得到识别结果。通过使用一维和二维卷积,该模型可以准确地识别手势的类别,从而实现更准确的识别。采用基于适应度的阿基米德优化算法(FAOA)对HCA-EffB7网络参数进行优化。这种混合方法可以识别复杂的手势,特别是在手语翻译系统中。通过与几个基准系统的性能比较,验证了该方法的有效性,从而证实了该方法在识别手语手势方面的优越性和鲁棒性。
{"title":"Optimization model for sign language recognition using hybrid convolution networks","authors":"S. Venkatesh ,&nbsp;Pravin R. Kshirsagar ,&nbsp;R. Thiagarajan ,&nbsp;Tan Kuan Tak ,&nbsp;B. Sivaneasan","doi":"10.1016/j.image.2025.117444","DOIUrl":"10.1016/j.image.2025.117444","url":null,"abstract":"<div><div>Sign Language Gestures Recognition model seeks to provide effective communication for people by converting their sign language motions into spoken or written language, allowing them to connect with non-signers. The deep features are extracted using Vision Transformer-YOLOv5 (ViT-YOLOv5), this model is employed to extract the Regions of Interest (ROI) from the images to generate the first set of features as F1. Concurrently, Scale-Invariant Feature Transform (SIFT) is used to extract a second set of features as F2 from the same images. The two extracted features are fed into a Hybrid Convolution-based Adaptive EfficientB7 Network (HCA-EfB7N). In this network, F1 is processed using 1D convolution, while F2 is processed using 2D convolution to obtain the recognized outcome. By utilizing both 1D and 2D convolutions, this proposed model accurately identifies the class of hand gestures, leading to more accurate recognition. The parameters of the HCA-EffB7 network are optimized using the Fitness-based Archimedes Optimization Algorithm (FAOA). This hybrid approach recognizes complex hand gestures, particularly in the sign language translation system. The proposed approach’s effectiveness is validated by comparing its performance against several baseline systems, thereby confirming its superiority and robustness in recognizing sign language gestures.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117444"},"PeriodicalIF":2.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey on video emotion recognition: Segmentation, classification, and explainable AI techniques 视频情感识别综述:分割、分类和可解释的人工智能技术
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-11-25 DOI: 10.1016/j.image.2025.117442
Sudhakar Hallur, Anil Gavade, Priyanka Gavade
Emotion recognition from videos has become a pivotal domain in computer vision and affective computing, contributing to advancements in human–computer interaction, healthcare, security, and multimedia analysis. This survey systematically reviews 137 research papers that span segmentation, classification, and explainable artificial intelligence (XAI) techniques for video-based emotion recognition. The study categorizes works into probabilistic, clustering, deep learning, affective computing, fuzzy logic, genetic algorithms, hybrid, multimodal, and XAI-based approaches. Through a structured evaluation of datasets such as FER2013, CK+, RAVDESS, AffectNet, and EMOTIC, this review highlights how convolutional, recurrent, and transformer architectures, combined with multimodal fusion and attention mechanisms, have enhanced emotion detection accuracy to maximum in certain contexts. It also identifies key challenges including dataset bias, multimodal synchronization, interpretability, and computational complexity. The paper emphasizes the rising importance of XAI in bridging the gap between model transparency and human cognition, proposing that future research focus on explainable, context-aware, and ethically grounded frameworks for robust emotion understanding. By consolidating diverse research trajectories, this survey offers a unified perspective on current advancements, limitations, and future directions in video emotion analysis.
视频情感识别已成为计算机视觉和情感计算的关键领域,有助于人机交互、医疗保健、安全和多媒体分析的进步。本调查系统地回顾了137篇研究论文,这些论文涵盖了基于视频的情感识别的分割、分类和可解释人工智能(XAI)技术。该研究将工作分类为概率、聚类、深度学习、情感计算、模糊逻辑、遗传算法、混合、多模态和基于xai的方法。通过对FER2013、CK+、RAVDESS、AffectNet和EMOTIC等数据集的结构化评估,本综述强调了卷积、循环和变压器架构如何结合多模态融合和注意机制,在某些情况下最大限度地提高了情绪检测的准确性。它还确定了关键挑战,包括数据集偏差、多模态同步、可解释性和计算复杂性。本文强调了XAI在弥合模型透明度和人类认知之间的差距方面的重要性,并建议未来的研究重点放在可解释的、上下文感知的和基于伦理的框架上,以实现强大的情感理解。通过整合不同的研究轨迹,本调查为视频情感分析的当前进展、局限性和未来方向提供了统一的视角。
{"title":"A survey on video emotion recognition: Segmentation, classification, and explainable AI techniques","authors":"Sudhakar Hallur,&nbsp;Anil Gavade,&nbsp;Priyanka Gavade","doi":"10.1016/j.image.2025.117442","DOIUrl":"10.1016/j.image.2025.117442","url":null,"abstract":"<div><div>Emotion recognition from videos has become a pivotal domain in computer vision and affective computing, contributing to advancements in human–computer interaction, healthcare, security, and multimedia analysis. This survey systematically reviews 137 research papers that span segmentation, classification, and explainable artificial intelligence (XAI) techniques for video-based emotion recognition. The study categorizes works into probabilistic, clustering, deep learning, affective computing, fuzzy logic, genetic algorithms, hybrid, multimodal, and XAI-based approaches. Through a structured evaluation of datasets such as FER2013, CK+, RAVDESS, AffectNet, and EMOTIC, this review highlights how convolutional, recurrent, and transformer architectures, combined with multimodal fusion and attention mechanisms, have enhanced emotion detection accuracy to maximum in certain contexts. It also identifies key challenges including dataset bias, multimodal synchronization, interpretability, and computational complexity. The paper emphasizes the rising importance of XAI in bridging the gap between model transparency and human cognition, proposing that future research focus on explainable, context-aware, and ethically grounded frameworks for robust emotion understanding. By consolidating diverse research trajectories, this survey offers a unified perspective on current advancements, limitations, and future directions in video emotion analysis.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117442"},"PeriodicalIF":2.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSTSGM: A multi-scale temporal–spatial guided model for image deblurring MSTSGM:一种多尺度时空导向图像去模糊模型
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-11-25 DOI: 10.1016/j.image.2025.117443
Boyu Pei , Kejun Long , Zhibo Gao , Jian Gu , Shaofei Wang , Xinhu Lu
Image deblurring is a critical task in computer vision, essential for recovering sharp images from blurry ones often caused by motion blur or camera shake. Recent advancements in deep learning have introduced convolutional neural networks (CNNs) as a powerful alternative, enabling the learning of intricate mappings between blurry and sharp images. However, existing deep learning approaches still struggle with effectively capturing low-frequency information and maintaining robustness across diverse blur conditions, while high-frequency details are often inadequately restored due to their susceptibility to motion blur. This paper presents the Multi-Scale Temporal–Spatial Guided Model (MSTSGM), which integrates multi-scale feature decoupling (MSFD), temporal convolution networks (TCN), and edge attention guided reconstruction (EAGR) to enhance deblurring performance. The MSFD captures a wide range of details by decomposing images into multi-scale representations, while the TCN refines these features by modeling temporal dependencies in blur formation. The EAGR focuses on key edge features, effectively improving image clarity. Evaluated on benchmark datasets including GoPro, HIDE, and RealBlur, MSTSGM demonstrates competitive performance, achieving higher PSNR and SSIM metrics compared to state-of-the-art methods. Ablation studies validate the contribution of each component, highlighting the synergistic effects of multi-scale processing, temporal feature integration, and edge attention. Furthermore, MSTSGM’s application as a preprocessing step for object detection tasks illustrates its practical utility in enhancing the accuracy of downstream computer vision applications. MSTSGM provides a robust solution for advancing image deblurring and related tasks in the field. Source code is available for research purposes at https://github.com/priplex/MSTSGM.
图像去模糊是计算机视觉中的一项关键任务,对于从通常由运动模糊或相机抖动引起的模糊图像中恢复清晰图像至关重要。深度学习的最新进展引入了卷积神经网络(cnn)作为强大的替代方案,使学习模糊和清晰图像之间的复杂映射成为可能。然而,现有的深度学习方法仍然难以有效地捕获低频信息并在不同的模糊条件下保持鲁棒性,而高频细节由于易受运动模糊的影响而往往无法充分恢复。本文提出了一种多尺度时空引导模型(MSTSGM),该模型集成了多尺度特征解耦(MSFD)、时间卷积网络(TCN)和边缘注意引导重建(EAGR)来增强图像去模糊性能。MSFD通过将图像分解成多尺度表示来捕获广泛的细节,而TCN通过在模糊形成中建模时间依赖性来细化这些特征。EAGR专注于关键边缘特征,有效提高图像清晰度。在包括GoPro、HIDE和RealBlur在内的基准数据集上进行评估,MSTSGM展示了具有竞争力的性能,与最先进的方法相比,实现了更高的PSNR和SSIM指标。消融研究验证了每个组成部分的贡献,强调了多尺度处理、时间特征整合和边缘关注的协同效应。此外,MSTSGM作为目标检测任务的预处理步骤的应用说明了它在提高下游计算机视觉应用的准确性方面的实际用途。MSTSGM为推进图像去模糊和现场相关任务提供了一个强大的解决方案。源代码可在https://github.com/priplex/MSTSGM上用于研究目的。
{"title":"MSTSGM: A multi-scale temporal–spatial guided model for image deblurring","authors":"Boyu Pei ,&nbsp;Kejun Long ,&nbsp;Zhibo Gao ,&nbsp;Jian Gu ,&nbsp;Shaofei Wang ,&nbsp;Xinhu Lu","doi":"10.1016/j.image.2025.117443","DOIUrl":"10.1016/j.image.2025.117443","url":null,"abstract":"<div><div>Image deblurring is a critical task in computer vision, essential for recovering sharp images from blurry ones often caused by motion blur or camera shake. Recent advancements in deep learning have introduced convolutional neural networks (CNNs) as a powerful alternative, enabling the learning of intricate mappings between blurry and sharp images. However, existing deep learning approaches still struggle with effectively capturing low-frequency information and maintaining robustness across diverse blur conditions, while high-frequency details are often inadequately restored due to their susceptibility to motion blur. This paper presents the Multi-Scale Temporal–Spatial Guided Model (MSTSGM), which integrates multi-scale feature decoupling (MSFD), temporal convolution networks (TCN), and edge attention guided reconstruction (EAGR) to enhance deblurring performance. The MSFD captures a wide range of details by decomposing images into multi-scale representations, while the TCN refines these features by modeling temporal dependencies in blur formation. The EAGR focuses on key edge features, effectively improving image clarity. Evaluated on benchmark datasets including GoPro, HIDE, and RealBlur, MSTSGM demonstrates competitive performance, achieving higher PSNR and SSIM metrics compared to state-of-the-art methods. Ablation studies validate the contribution of each component, highlighting the synergistic effects of multi-scale processing, temporal feature integration, and edge attention. Furthermore, MSTSGM’s application as a preprocessing step for object detection tasks illustrates its practical utility in enhancing the accuracy of downstream computer vision applications. MSTSGM provides a robust solution for advancing image deblurring and related tasks in the field. Source code is available for research purposes at <span><span>https://github.com/priplex/MSTSGM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117443"},"PeriodicalIF":2.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keypoint detection in Tai Chi Chuan Essence via Waist and Limbs Feature Separation 基于腰、四肢特征分离的太极拳精华关键点检测
IF 2.7 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-11-25 DOI: 10.1016/j.image.2025.117439
Yi Yang , Hao Fu , Yunlong Lv , Wei Qian , Tian Wang
Tai Chi Chuan (TCC) teaching process requires face-to-face interaction between learners and masters, which restrains the promotion of TCC since skilled masters are rare and manual evaluation is time-consuming. Skeleton-based action recognition models are effective for this issue, while the existing human skeletal structures and detection frameworks are inadequate for this task. To address these challenges, a novel human skeletal structure aligned with the essence of TCC is proposed, along with a dataset with 2400 annotated images covering fundamental TCC movements. Furthermore, a module named Waist and Limbs Feature Separation (WLFS) is proposed for structured modeling. Based on the spatiotemporal characteristics of TCC movements, the WLFS module explicitly separates keypoints into dynamic and static categories channel-wise. Subsequently, two exclusive GAUs are applied to the static and dynamic regions respectively. This strategy enables the network to learn the distinct features of the separated categories regions, and accelerates the convergence of the network weights during the training process. To preserve the fine-scale features of TCC keypoints in the downsampling, a Multi-Scale Feature Fusion (MSFF) module is integrated into WLFS, which fuses the different spatial resolution feature maps to enhance feature representation of the model at small scales. Experiments on the custom dataset (Tai Chi) and public datasets (MPII, COCO-WholeBody V1.0, Animal Pose, and AP-10K) demonstrate that the proposed method achieves competitive performance and good generalization ability.
太极拳的教学过程需要学习者和大师面对面的互动,这制约了太极拳的推广,因为熟练的大师很少,人工评估耗时。基于骨骼的动作识别模型是有效的,而现有的人体骨骼结构和检测框架不足以完成这一任务。为了解决这些挑战,我们提出了一种符合TCC本质的新型人体骨骼结构,以及包含2400张注释图像的数据集,这些图像涵盖了TCC的基本运动。在此基础上,提出了用于结构化建模的腰肢特征分离模块(WLFS)。基于TCC运动的时空特征,WLFS模块明确地将关键点按通道划分为动态和静态类别。随后,分别在静态和动态区域应用两个独占gau。该策略使网络能够学习到分离类别区域的不同特征,并在训练过程中加速了网络权值的收敛。为了在下采样中保留TCC关键点的精细尺度特征,在WLFS中集成了多尺度特征融合(MSFF)模块,该模块融合了不同空间分辨率的特征图,增强了模型在小尺度上的特征表征。在自定义数据集(太极)和公共数据集(MPII、COCO-WholeBody V1.0、Animal Pose和AP-10K)上的实验表明,该方法具有良好的性能和泛化能力。
{"title":"Keypoint detection in Tai Chi Chuan Essence via Waist and Limbs Feature Separation","authors":"Yi Yang ,&nbsp;Hao Fu ,&nbsp;Yunlong Lv ,&nbsp;Wei Qian ,&nbsp;Tian Wang","doi":"10.1016/j.image.2025.117439","DOIUrl":"10.1016/j.image.2025.117439","url":null,"abstract":"<div><div>Tai Chi Chuan (TCC) teaching process requires face-to-face interaction between learners and masters, which restrains the promotion of TCC since skilled masters are rare and manual evaluation is time-consuming. Skeleton-based action recognition models are effective for this issue, while the existing human skeletal structures and detection frameworks are inadequate for this task. To address these challenges, a novel human skeletal structure aligned with the essence of TCC is proposed, along with a dataset with 2400 annotated images covering fundamental TCC movements. Furthermore, a module named Waist and Limbs Feature Separation (WLFS) is proposed for structured modeling. Based on the spatiotemporal characteristics of TCC movements, the WLFS module explicitly separates keypoints into dynamic and static categories channel-wise. Subsequently, two exclusive GAUs are applied to the static and dynamic regions respectively. This strategy enables the network to learn the distinct features of the separated categories regions, and accelerates the convergence of the network weights during the training process. To preserve the fine-scale features of TCC keypoints in the downsampling, a Multi-Scale Feature Fusion (MSFF) module is integrated into WLFS, which fuses the different spatial resolution feature maps to enhance feature representation of the model at small scales. Experiments on the custom dataset (Tai Chi) and public datasets (MPII, COCO-WholeBody V1.0, Animal Pose, and AP-10K) demonstrate that the proposed method achieves competitive performance and good generalization ability.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117439"},"PeriodicalIF":2.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Signal Processing-Image Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1