Pub Date : 2024-06-17DOI: 10.1007/s00530-024-01379-9
Hongyang Wang, Ting Wang, Dong Xiang, Wenjie Yang, Jia Li
In response to the significant parameter overhead in current Generative Adversarial Networks (GAN) inversion methods when balancing high fidelity and editability, we propose a novel lightweight inversion framework based on an optimized generator. We aim to balance fidelity and editability within the StyleGAN latent space. To achieve this, the study begins by mapping raw data to the ({W}^{+}) latent space, enhancing the quality of the resulting inverted images. Following this mapping step, we introduce a carefully designed lightweight hypernetwork. This hypernetwork operates to selectively modify primary detailed features, thereby leading to a notable reduction in the parameter count essential for model training. By learning parameter variations, the precision of subsequent image editing is augmented. Lastly, our approach integrates a multi-channel parallel optimization computing module into the above structure to decrease the time needed for model image processing. Extensive experiments were conducted in facial and automotive imagery domains to validate our lightweight inversion framework. Results demonstrate that our method achieves equivalent or superior inversion and editing quality, utilizing fewer parameters.
{"title":"Low-parameter GAN inversion framework based on hypernetwork","authors":"Hongyang Wang, Ting Wang, Dong Xiang, Wenjie Yang, Jia Li","doi":"10.1007/s00530-024-01379-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01379-9","url":null,"abstract":"<p>In response to the significant parameter overhead in current Generative Adversarial Networks (GAN) inversion methods when balancing high fidelity and editability, we propose a novel lightweight inversion framework based on an optimized generator. We aim to balance fidelity and editability within the StyleGAN latent space. To achieve this, the study begins by mapping raw data to the <span>({W}^{+})</span> latent space, enhancing the quality of the resulting inverted images. Following this mapping step, we introduce a carefully designed lightweight hypernetwork. This hypernetwork operates to selectively modify primary detailed features, thereby leading to a notable reduction in the parameter count essential for model training. By learning parameter variations, the precision of subsequent image editing is augmented. Lastly, our approach integrates a multi-channel parallel optimization computing module into the above structure to decrease the time needed for model image processing. Extensive experiments were conducted in facial and automotive imagery domains to validate our lightweight inversion framework. Results demonstrate that our method achieves equivalent or superior inversion and editing quality, utilizing fewer parameters.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"5 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1007/s00530-024-01384-y
Weilin Li, Jiaming Guo, Hong Wu
Human activity recognition (HAR) with wearable inertial sensors is a burgeoning field, propelled by advances in sensor technology. Deep learning methods for HAR have notably enhanced recognition accuracy in recent years. Nonetheless, the complexity of previous models often impedes their use in real-life scenarios, particularly in online applications. Addressing this gap, we introduce SenseMLP, a novel approach employing a multi-layer perceptron (MLP) neural network architecture. SenseMLP features three parallel MLP branches that independently process and integrate features across the time, channel, and frequency dimensions. This structure not only simplifies the model but also significantly reduces the number of required parameters compared to previous deep learning HAR frameworks. We conducted comprehensive evaluations of SenseMLP against benchmark HAR datasets, including PAMAP2, OPPORTUNITY, USC-HAD, and SKODA. Our findings demonstrate that SenseMLP not only achieves state-of-the-art performance in terms of accuracy but also boasts fewer parameters and lower floating-point operations per second. For further research and application in the field, the source code of SenseMLP is available at https://github.com/forfrees/SenseMLP.
在传感器技术进步的推动下,利用可穿戴惯性传感器进行人类活动识别(HAR)是一个新兴领域。近年来,用于 HAR 的深度学习方法显著提高了识别准确率。然而,以往模型的复杂性往往阻碍了它们在现实生活场景中的应用,尤其是在线应用。为了弥补这一不足,我们引入了 SenseMLP,这是一种采用多层感知器(MLP)神经网络架构的新方法。SenseMLP 具有三个并行的 MLP 分支,可独立处理和整合时间、信道和频率维度上的特征。与之前的深度学习 HAR 框架相比,这种结构不仅简化了模型,还大大减少了所需参数的数量。我们针对基准 HAR 数据集(包括 PAMAP2、OPPORTUNITY、USC-HAD 和 SKODA)对 SenseMLP 进行了全面评估。我们的研究结果表明,SenseMLP 不仅在准确性方面达到了最先进的性能,而且参数更少,每秒浮点运算次数更少。如需进一步研究和应用,请访问 https://github.com/forfrees/SenseMLP 获取 SenseMLP 的源代码。
{"title":"SenseMLP: a parallel MLP architecture for sensor-based human activity recognition","authors":"Weilin Li, Jiaming Guo, Hong Wu","doi":"10.1007/s00530-024-01384-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01384-y","url":null,"abstract":"<p>Human activity recognition (HAR) with wearable inertial sensors is a burgeoning field, propelled by advances in sensor technology. Deep learning methods for HAR have notably enhanced recognition accuracy in recent years. Nonetheless, the complexity of previous models often impedes their use in real-life scenarios, particularly in online applications. Addressing this gap, we introduce SenseMLP, a novel approach employing a multi-layer perceptron (MLP) neural network architecture. SenseMLP features three parallel MLP branches that independently process and integrate features across the time, channel, and frequency dimensions. This structure not only simplifies the model but also significantly reduces the number of required parameters compared to previous deep learning HAR frameworks. We conducted comprehensive evaluations of SenseMLP against benchmark HAR datasets, including PAMAP2, OPPORTUNITY, USC-HAD, and SKODA. Our findings demonstrate that SenseMLP not only achieves state-of-the-art performance in terms of accuracy but also boasts fewer parameters and lower floating-point operations per second. For further research and application in the field, the source code of SenseMLP is available at https://github.com/forfrees/SenseMLP.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"36 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141515062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14DOI: 10.1007/s00530-024-01367-z
Qihan He, Zhongxu Li, Wenyuan Yang
Road damage detection using computer vision and deep learning to automatically identify all kinds of road damage is an efficient application in object detection, which can significantly improve the efficiency of road maintenance planning and repair work and ensure road safety. However, due to the complexity of target recognition, the existing road damage detection models usually carry a large number of parameters and a large amount of computation, resulting in a slow inference speed, which limits the actual deployment of the model on the equipment with limited computing resources to a certain extent. In this study, we propose a road damage detector named LMFE-RDD for balancing speed and accuracy, which constructs a Lightweight Multi-Feature Extraction Network (LMFE-Net) as the backbone network and an Efficient Semantic Fusion Network (ESF-Net) for multi-scale feature fusion. First, as the backbone feature extraction network, LMFE-Net inputs road damage images to obtain three different scale feature maps. Second, ESF-Net fuses these three feature graphs and outputs three fusion features. Finally, the detection head is sent for target identification and positioning, and the final result is obtained. In addition, we use WDB loss, a multi-task loss function with a non-monotonic dynamic focusing mechanism, to pay more attention to bounding box regression losses. The experimental results show that the proposed LMFE-RDD model has competitive accuracy while ensuring speed. In the Multi-Perspective Road Damage Dataset, combining the data from all perspectives, LMFE-RDD achieves the detection speed of 51.0 FPS and 64.2% mAP@0.5, but the parameters are only 13.5 M.
{"title":"LMFE-RDD: a road damage detector with a lightweight multi-feature extraction network","authors":"Qihan He, Zhongxu Li, Wenyuan Yang","doi":"10.1007/s00530-024-01367-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01367-z","url":null,"abstract":"<p>Road damage detection using computer vision and deep learning to automatically identify all kinds of road damage is an efficient application in object detection, which can significantly improve the efficiency of road maintenance planning and repair work and ensure road safety. However, due to the complexity of target recognition, the existing road damage detection models usually carry a large number of parameters and a large amount of computation, resulting in a slow inference speed, which limits the actual deployment of the model on the equipment with limited computing resources to a certain extent. In this study, we propose a road damage detector named LMFE-RDD for balancing speed and accuracy, which constructs a Lightweight Multi-Feature Extraction Network (LMFE-Net) as the backbone network and an Efficient Semantic Fusion Network (ESF-Net) for multi-scale feature fusion. First, as the backbone feature extraction network, LMFE-Net inputs road damage images to obtain three different scale feature maps. Second, ESF-Net fuses these three feature graphs and outputs three fusion features. Finally, the detection head is sent for target identification and positioning, and the final result is obtained. In addition, we use WDB loss, a multi-task loss function with a non-monotonic dynamic focusing mechanism, to pay more attention to bounding box regression losses. The experimental results show that the proposed LMFE-RDD model has competitive accuracy while ensuring speed. In the Multi-Perspective Road Damage Dataset, combining the data from all perspectives, LMFE-RDD achieves the detection speed of 51.0 FPS and 64.2% mAP@0.5, but the parameters are only 13.5 M.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"36 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141515063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1007/s00530-024-01373-1
Yunpeng Jia, Xiufen Ye, Xinkui Mei, Yusong Liu, Shuxiang Guo
Vision-language models (VLM), such as Contrastive Language-Image Pretraining (CLIP), have demonstrated powerful capabilities in image classification under zero-shot settings. However, current zero-shot learning (ZSL) relies on manually tagged samples of known classes through supervised learning, resulting in a waste of labor costs and limitations on foreseeable classes in real-world applications. To address these challenges, we propose the mixup long-tail unsupervised (MLTU) approach for open-world ZSL problems. The proposed approach employs a novel long-tail mixup loss that integrated class-based re-weighting assignments with a given mixup factor for each mixed visual embedding. To mitigate the adverse impact over time, we adopt a noisy learning strategy to filter out samples that generated incorrect labels. We reproduce the unsupervised experiments of existing state-of-the-art long-tail and noisy learning approaches. Experimental results demonstrate that MLTU achieves significant improvements in classification compared to these proven existing approaches on public datasets. Moreover, it serves as a plug-and-play solution for amending previous assignments and enhancing unsupervised performance. MLTU enables the automatic classification and correction of incorrect predictions caused by the projection bias of CLIP.
{"title":"MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models","authors":"Yunpeng Jia, Xiufen Ye, Xinkui Mei, Yusong Liu, Shuxiang Guo","doi":"10.1007/s00530-024-01373-1","DOIUrl":"https://doi.org/10.1007/s00530-024-01373-1","url":null,"abstract":"<p>Vision-language models (VLM), such as Contrastive Language-Image Pretraining (CLIP), have demonstrated powerful capabilities in image classification under zero-shot settings. However, current zero-shot learning (ZSL) relies on manually tagged samples of known classes through supervised learning, resulting in a waste of labor costs and limitations on foreseeable classes in real-world applications. To address these challenges, we propose the mixup long-tail unsupervised (MLTU) approach for open-world ZSL problems. The proposed approach employs a novel long-tail mixup loss that integrated class-based re-weighting assignments with a given mixup factor for each mixed visual embedding. To mitigate the adverse impact over time, we adopt a noisy learning strategy to filter out samples that generated incorrect labels. We reproduce the unsupervised experiments of existing state-of-the-art long-tail and noisy learning approaches. Experimental results demonstrate that MLTU achieves significant improvements in classification compared to these proven existing approaches on public datasets. Moreover, it serves as a plug-and-play solution for amending previous assignments and enhancing unsupervised performance. MLTU enables the automatic classification and correction of incorrect predictions caused by the projection bias of CLIP.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"12 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141253875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-04DOI: 10.1007/s00530-024-01370-4
Xiaojun Tong, Xilin Liu, Tao Pan, Miao Zhang, Zhu Wang
Aiming at the traditional schemes for encrypting and transmitting images can be subject to arbitrary destruction by attackers, making it difficult for algorithms with poor robustness to recover the original image, this paper proposes a new visually image encryption algorithm, which can embed the compressed and encrypted image into a carrier image to achieve visual security, thus avoiding destruction and attacks. Foremost, a new conservative hyperchaotic system without attractors was constructed that can resist reconstruction attacks. Secondly, a two-dimensional (2D) compressed sensing technique is adopted, and the pseudo random sequences of the proposed chaotic system generates a measurement matrix in compressed sensing, and optimizes this matrix to improve the visual quality of image reconstruction. Finally, by combining discrete wavelet transform (DWT) and singular value decomposition (SVD) methods, the encrypted image is embedded into the carrier image to achieve the purpose of image compression, encryption, and hiding. And experimental results and comparative analysis demonstrate that this algorithm has high security, good image reconstruction quality, and strong imperceptibility after image embedding. Under limited bandwidth conditions, the algorithm achieves excellent visual security effects.
{"title":"A visually meaningful secure image encryption algorithm based on conservative hyperchaotic system and optimized compressed sensing","authors":"Xiaojun Tong, Xilin Liu, Tao Pan, Miao Zhang, Zhu Wang","doi":"10.1007/s00530-024-01370-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01370-4","url":null,"abstract":"<p>Aiming at the traditional schemes for encrypting and transmitting images can be subject to arbitrary destruction by attackers, making it difficult for algorithms with poor robustness to recover the original image, this paper proposes a new visually image encryption algorithm, which can embed the compressed and encrypted image into a carrier image to achieve visual security, thus avoiding destruction and attacks. Foremost, a new conservative hyperchaotic system without attractors was constructed that can resist reconstruction attacks. Secondly, a two-dimensional (2D) compressed sensing technique is adopted, and the pseudo random sequences of the proposed chaotic system generates a measurement matrix in compressed sensing, and optimizes this matrix to improve the visual quality of image reconstruction. Finally, by combining discrete wavelet transform (DWT) and singular value decomposition (SVD) methods, the encrypted image is embedded into the carrier image to achieve the purpose of image compression, encryption, and hiding. And experimental results and comparative analysis demonstrate that this algorithm has high security, good image reconstruction quality, and strong imperceptibility after image embedding. Under limited bandwidth conditions, the algorithm achieves excellent visual security effects.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"18 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141253772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-04DOI: 10.1007/s00530-024-01364-2
Anan Liu, Yanwei Xie, Lanjun Wang, Guoqing Jin, Junbo Guo, Jun Li
Online social networks are easily exploited by social bots. Although the current models for detecting social bots show promising results, they mainly rely on Graph Neural Networks (GNNs), which have been proven to have vulnerabilities in robustness and these detection models likely have similar robustness vulnerabilities. Therefore, it is crucial to evaluate and improve their robustness. This paper proposes a robustness evaluation method: Attribute Random Iteration-Fast Gradient Sign Method (ARI-FGSM) and uses a simplified adversarial training to improve the robustness of social bot detection. Specifically, this study performs robustness evaluations of five bot detection models on two datasets under both black-box and white-box scenarios. The white-box experiments achieve a minimum attack success rate of 86.23%, while the black-box experiments achieve a minimum attack success rate of 45.86%. This shows that the social bot detection model is vulnerable to adversarial attacks. Moreover, after executing our robustness improvement method, the robustness of the detection model increased by up to 86.98%.
{"title":"Social bot detection on Twitter: robustness evaluation and improvement","authors":"Anan Liu, Yanwei Xie, Lanjun Wang, Guoqing Jin, Junbo Guo, Jun Li","doi":"10.1007/s00530-024-01364-2","DOIUrl":"https://doi.org/10.1007/s00530-024-01364-2","url":null,"abstract":"<p>Online social networks are easily exploited by social bots. Although the current models for detecting social bots show promising results, they mainly rely on Graph Neural Networks (GNNs), which have been proven to have vulnerabilities in robustness and these detection models likely have similar robustness vulnerabilities. Therefore, it is crucial to evaluate and improve their robustness. This paper proposes a robustness evaluation method: Attribute Random Iteration-Fast Gradient Sign Method (ARI-FGSM) and uses a simplified adversarial training to improve the robustness of social bot detection. Specifically, this study performs robustness evaluations of five bot detection models on two datasets under both black-box and white-box scenarios. The white-box experiments achieve a minimum attack success rate of 86.23%, while the black-box experiments achieve a minimum attack success rate of 45.86%. This shows that the social bot detection model is vulnerable to adversarial attacks. Moreover, after executing our robustness improvement method, the robustness of the detection model increased by up to 86.98%.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"127 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141253946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-02DOI: 10.1007/s00530-024-01360-6
Liwen Huang, Wenyuan Yang
In recent years, scene text detection has brought out broader prospects via growing applied opportunities. Nevertheless, pointing out which detected capability and suitable instantaneity in equilibrium is an essential consideration of irregular text detection. Out of consideration for the trouble, we propose an efficient scene text detector that unites a Dilated Recombined Unit (DRU) and a Efficient Reorganized Unit (ERU), named DENet. In the beginning, input feature information is received into a DR-VanillaNet backbone. Dilated recombined unit is devised to insert into every block of DR-VanillaNet to heighten the connection about distant pixel points. Next, an FPN with efficient reorganized unit tends to exploit feature redundancy and permutate channels partially. Correspondingly, DRU and ERU work on constructive effect for precision with a limited descent of speed. Moreover, a progressive scale expansion is carried forward which maintains the ability to generate the adjacent instances successfully. Multiple experiments on CTW1500, Total-Text benchmark datasets prove that designed model intends to improve precision accompanied by a limited drop of speed. It is specifically indicated that the value of precision on these two datasets reaches 84.29% and 85.30%. And FPS are achieved by 8.6 and 10.9, respectively.
{"title":"A irregular text detection via dilated recombination and efficient reorganization on natural scene","authors":"Liwen Huang, Wenyuan Yang","doi":"10.1007/s00530-024-01360-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01360-6","url":null,"abstract":"<p>In recent years, scene text detection has brought out broader prospects via growing applied opportunities. Nevertheless, pointing out which detected capability and suitable instantaneity in equilibrium is an essential consideration of irregular text detection. Out of consideration for the trouble, we propose an efficient scene text detector that unites a Dilated Recombined Unit (DRU) and a Efficient Reorganized Unit (ERU), named DENet. In the beginning, input feature information is received into a DR-VanillaNet backbone. Dilated recombined unit is devised to insert into every block of DR-VanillaNet to heighten the connection about distant pixel points. Next, an FPN with efficient reorganized unit tends to exploit feature redundancy and permutate channels partially. Correspondingly, DRU and ERU work on constructive effect for precision with a limited descent of speed. Moreover, a progressive scale expansion is carried forward which maintains the ability to generate the adjacent instances successfully. Multiple experiments on CTW1500, Total-Text benchmark datasets prove that designed model intends to improve precision accompanied by a limited drop of speed. It is specifically indicated that the value of precision on these two datasets reaches 84.29% and 85.30%. And FPS are achieved by 8.6 and 10.9, respectively.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"29 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01DOI: 10.1007/s00530-024-01363-3
Huapeng Wu, Hui Xu, Tianming Zhan
Recently, transformer networks based on hyperspectral image super-resolution have achieved significant performance in comparison with most convolution neural networks. However, this is still an open problem of how to efficiently design a lightweight transformer structure to extract long-range spatial and spectral information from hyperspectral images. This paper proposes a novel spatial and spectral transformer network (SSTN) for hyperspectral image super-resolution. Specifically, the proposed transformer framework mainly consists of multiple consecutive alternating global attention layers and regional attention layers. In the global attention layer, a spatial and spectral self-attention module with less complexity is introduced to learn spatial and spectral global interaction, which can enhance the representation ability of the network. In addition, the proposed regional attention layer can extract regional feature information by using a window self-attention based on zero-padding strategy. This alternating architecture can adaptively learn regional and global feature information of hyperspectral images. Extensive experimental results demonstrate that the proposed method can achieve superior performance in comparison with the state-of-the-art hyperspectral image super-resolution methods.
{"title":"A novel spatial and spectral transformer network for hyperspectral image super-resolution","authors":"Huapeng Wu, Hui Xu, Tianming Zhan","doi":"10.1007/s00530-024-01363-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01363-3","url":null,"abstract":"<p>Recently, transformer networks based on hyperspectral image super-resolution have achieved significant performance in comparison with most convolution neural networks. However, this is still an open problem of how to efficiently design a lightweight transformer structure to extract long-range spatial and spectral information from hyperspectral images. This paper proposes a novel spatial and spectral transformer network (SSTN) for hyperspectral image super-resolution. Specifically, the proposed transformer framework mainly consists of multiple consecutive alternating global attention layers and regional attention layers. In the global attention layer, a spatial and spectral self-attention module with less complexity is introduced to learn spatial and spectral global interaction, which can enhance the representation ability of the network. In addition, the proposed regional attention layer can extract regional feature information by using a window self-attention based on zero-padding strategy. This alternating architecture can adaptively learn regional and global feature information of hyperspectral images. Extensive experimental results demonstrate that the proposed method can achieve superior performance in comparison with the state-of-the-art hyperspectral image super-resolution methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"29 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1007/s00530-024-01368-y
Zhi Liu, Shengzhao Hao, Yunhua Lu, Lei Liu, Cong Chen, Ruohuang Wang
Human pose estimation is a popular and challenging task in computer vision. Currently, the mainstream methods for pose estimation are based on Gaussian heatmaps and coordinate regression techniques. However, the intensive computational overhead and quantization error introduced by heatmaps pose many limitations on their application. And coordinate regression faces difficulties in learning mapping cross and misaligned keypoints, resulting in poor robustness. Recently, pose estimation based on Coordinate Classification encodes global spatial information into one-dimensional representations in X and Y directions, which turns keypoint localization into a classification problem and thus simplifies the model while effectively improving pose estimation accuracy. Motivated by this, SD-Pose is proposed in this work, which is a spatially decoupled human pose estimation model guided by adaptive pose perception. Specifically, the model first employs a Pyramid Adaptive Feature Extractor (PAFE) to obtain multi-scale featuremaps and generate adaptive keypoint weights to assist the model in extracting unique features for keypoints at different locations. Then, the Spatial Decoupling and Coordinated Analysis Module (SDCAM) simplifies the localization problem while considering both global and fine-grained features. Experimental results on MPII human pose and COCO keypoint detection datasets validate the effectiveness of the SD-Pose model and also display satisfied performance in recovering detailed information for keypoints such as Elbow, Hip, and Ankle.
人体姿态估计是计算机视觉领域一项热门而又具有挑战性的任务。目前,姿势估计的主流方法基于高斯热图和坐标回归技术。然而,热图带来的密集计算开销和量化误差对其应用造成了诸多限制。而坐标回归在学习映射交叉和错位关键点时面临困难,导致鲁棒性较差。最近,基于坐标分类的姿态估计将全局空间信息编码为 X 和 Y 方向的一维表示,这就把关键点定位变成了分类问题,从而简化了模型,同时有效提高了姿态估计的准确性。受此启发,本研究提出了 SD-Pose 模型,它是一种以自适应姿势感知为指导的空间解耦人体姿势估计模型。具体来说,该模型首先采用金字塔自适应特征提取器(PAFE)获取多尺度特征图,并生成自适应关键点权重,以帮助模型提取不同位置关键点的独特特征。然后,空间解耦与协调分析模块(SDCAM)简化了定位问题,同时考虑了全局和细粒度特征。在 MPII 人体姿态和 COCO 关键点检测数据集上的实验结果验证了 SD-Pose 模型的有效性,并显示了在恢复肘部、髋部和踝部等关键点的详细信息方面令人满意的性能。
{"title":"SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance","authors":"Zhi Liu, Shengzhao Hao, Yunhua Lu, Lei Liu, Cong Chen, Ruohuang Wang","doi":"10.1007/s00530-024-01368-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01368-y","url":null,"abstract":"<p>Human pose estimation is a popular and challenging task in computer vision. Currently, the mainstream methods for pose estimation are based on Gaussian heatmaps and coordinate regression techniques. However, the intensive computational overhead and quantization error introduced by heatmaps pose many limitations on their application. And coordinate regression faces difficulties in learning mapping cross and misaligned keypoints, resulting in poor robustness. Recently, pose estimation based on Coordinate Classification encodes global spatial information into one-dimensional representations in X and Y directions, which turns keypoint localization into a classification problem and thus simplifies the model while effectively improving pose estimation accuracy. Motivated by this, SD-Pose is proposed in this work, which is a spatially decoupled human pose estimation model guided by adaptive pose perception. Specifically, the model first employs a Pyramid Adaptive Feature Extractor (PAFE) to obtain multi-scale featuremaps and generate adaptive keypoint weights to assist the model in extracting unique features for keypoints at different locations. Then, the Spatial Decoupling and Coordinated Analysis Module (SDCAM) simplifies the localization problem while considering both global and fine-grained features. Experimental results on MPII human pose and COCO keypoint detection datasets validate the effectiveness of the SD-Pose model and also display satisfied performance in recovering detailed information for keypoints such as Elbow, Hip, and Ankle.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"48 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1007/s00530-024-01369-x
Hao Wang, Yan Tian, Yongchuan Xu, Jiahui Xu, Tao Yang, Yan Lu, Hong Chen
Digital orthodontic treatment monitoring has been gaining increasing attention in the past decade. However, current methods based on deep learning still face difficult challenges. Transformer, due to its excellent ability to model long-term dependencies, can be applied to the task of tooth point cloud registration. Nonetheless, most transformer-based point cloud registration networks suffer from two problems. First, they lack the embedding of credible geometric information, resulting in learned features that are not geometrically discriminative and blur the boundary between inliers and outliers. Second, the attention mechanism lacks continuous downsampling during geometric transformation invariant feature extraction at the superpixel level, thereby limiting the field of view and potentially limiting the model’s perception of local and global information. In this paper, we propose GeoSwin, which uses a novel geometric window transformer to achieve accurate registration of tooth point clouds in different stages of orthodontic treatment. This method uses the point distance, normal vector angle, and bidirectional spatial angular distances as the input geometric embedding of transformer, and then uses a proposed variable multiscale attention mechanism to achieve geometric information perception from local to global perspectives. Experiments on the Shing3D Dental Dataset demonstrate the effectiveness of our approach and that it outperforms other state-of-the-art approaches across multiple metrics. Our code and models are available at GeoSwin.
{"title":"Multiscale geometric window transformer for orthodontic teeth point cloud registration","authors":"Hao Wang, Yan Tian, Yongchuan Xu, Jiahui Xu, Tao Yang, Yan Lu, Hong Chen","doi":"10.1007/s00530-024-01369-x","DOIUrl":"https://doi.org/10.1007/s00530-024-01369-x","url":null,"abstract":"<p>Digital orthodontic treatment monitoring has been gaining increasing attention in the past decade. However, current methods based on deep learning still face difficult challenges. Transformer, due to its excellent ability to model long-term dependencies, can be applied to the task of tooth point cloud registration. Nonetheless, most transformer-based point cloud registration networks suffer from two problems. First, they lack the embedding of credible geometric information, resulting in learned features that are not geometrically discriminative and blur the boundary between inliers and outliers. Second, the attention mechanism lacks continuous downsampling during geometric transformation invariant feature extraction at the superpixel level, thereby limiting the field of view and potentially limiting the model’s perception of local and global information. In this paper, we propose GeoSwin, which uses a novel geometric window transformer to achieve accurate registration of tooth point clouds in different stages of orthodontic treatment. This method uses the point distance, normal vector angle, and bidirectional spatial angular distances as the input geometric embedding of transformer, and then uses a proposed variable multiscale attention mechanism to achieve geometric information perception from local to global perspectives. Experiments on the Shing3D Dental Dataset demonstrate the effectiveness of our approach and that it outperforms other state-of-the-art approaches across multiple metrics. Our code and models are available at GeoSwin.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"495 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141194712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}