Pub Date : 2024-09-12DOI: 10.1007/s00371-024-03619-5
Yiting Wu, Pinqi Fang, Xiangning Wang, Jie Shen
Pancreatic cancer (PC) is an extremely deadly cancer, with mortality rates closely tied to its frequency of occurrence. By the time of diagnosis, pancreatic cancer often presents at an advanced stage, and has often spread to other parts of the body. Due to the poor survival outcomes, PDAC is the fifth leading cause of global cancer death. The 5-year relative survival rate of pancreatic cancer was about 6% and the lowest level in all cancers. Currently, there are no established guidance for screening individuals at high risk for pancreatic cancer, including those with a family history of the pancreatic disease or chronic pancreatitis (CP). With the development of medicine, fundus maps can now predict many systemic diseases. Subsequently, the association between ocular changes and a few pancreatic diseases was also discovered. Therefore, our objective is to construct a deep learning model aimed at identifying correlations between ocular features and significant pancreatic ailments. The utilization of AI and fundus images has extended beyond the investigation of ocular disorders. Hence, in order to solve the tasks of PC and CP classification, we propose a brand new deep learning model (PANet) that integrates pre-trained CNN network, multi-scale feature modules, attention mechanisms, and an FC classifier. PANet adopts a ResNet34 backbone and selectively integrates attention modules to construct its fundamental architecture. To enhance feature extraction capability, PANet combines multi-scale feature modules before the attention module. Our model is trained and evaluated using a dataset comprising 1300 fundus images. The experimental outcomes illustrate the successful realization of our objectives, with the model achieving an accuracy of 91.50% and an area under the receiver operating characteristic curve (AUC) of 96.00% in PC classification, and an accuracy of 95.60% and an AUC of 99.20% in CP classification. Our study establishes a characterizing link between ocular features and major pancreatic diseases, providing a non-invasive, convenient, and complementary method for screening and detection of pancreatic diseases.
胰腺癌(PC)是一种极其致命的癌症,死亡率与其发病频率密切相关。在确诊时,胰腺癌往往已是晚期,而且往往已扩散到身体的其他部位。由于生存率低,胰腺癌已成为全球第五大癌症死因。胰腺癌的 5 年相对生存率约为 6%,是所有癌症中最低的。目前,对于胰腺癌高危人群(包括有胰腺疾病家族史或慢性胰腺炎(CP)患者)的筛查还没有既定的指南。随着医学的发展,眼底图现在可以预测许多系统性疾病。随后,眼底变化与一些胰腺疾病之间的关联也被发现。因此,我们的目标是构建一个深度学习模型,旨在识别眼部特征与重大胰腺疾病之间的相关性。人工智能和眼底图像的应用已经超出了眼部疾病的调查范围。因此,为了解决 PC 和 CP 分类任务,我们提出了一种全新的深度学习模型(PANet),该模型集成了预训练 CNN 网络、多尺度特征模块、注意力机制和 FC 分类器。PANet 采用 ResNet34 作为骨干,并有选择地集成了注意力模块,从而构建了其基本架构。为了增强特征提取能力,PANet 在注意力模块之前结合了多尺度特征模块。我们使用由 1300 张眼底图像组成的数据集对我们的模型进行了训练和评估。实验结果表明,模型成功实现了我们的目标,在 PC 分类中的准确率达到 91.50%,接收者操作特征曲线下面积(AUC)达到 96.00%;在 CP 分类中的准确率达到 95.60%,接收者操作特征曲线下面积(AUC)达到 99.20%。我们的研究建立了眼部特征与主要胰腺疾病之间的联系,为胰腺疾病的筛查和检测提供了一种无创、便捷的补充方法。
{"title":"Predicting pancreatic diseases from fundus images using deep learning","authors":"Yiting Wu, Pinqi Fang, Xiangning Wang, Jie Shen","doi":"10.1007/s00371-024-03619-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03619-5","url":null,"abstract":"<p>Pancreatic cancer (PC) is an extremely deadly cancer, with mortality rates closely tied to its frequency of occurrence. By the time of diagnosis, pancreatic cancer often presents at an advanced stage, and has often spread to other parts of the body. Due to the poor survival outcomes, PDAC is the fifth leading cause of global cancer death. The 5-year relative survival rate of pancreatic cancer was about 6% and the lowest level in all cancers. Currently, there are no established guidance for screening individuals at high risk for pancreatic cancer, including those with a family history of the pancreatic disease or chronic pancreatitis (CP). With the development of medicine, fundus maps can now predict many systemic diseases. Subsequently, the association between ocular changes and a few pancreatic diseases was also discovered. Therefore, our objective is to construct a deep learning model aimed at identifying correlations between ocular features and significant pancreatic ailments. The utilization of AI and fundus images has extended beyond the investigation of ocular disorders. Hence, in order to solve the tasks of PC and CP classification, we propose a brand new deep learning model (PANet) that integrates pre-trained CNN network, multi-scale feature modules, attention mechanisms, and an FC classifier. PANet adopts a ResNet34 backbone and selectively integrates attention modules to construct its fundamental architecture. To enhance feature extraction capability, PANet combines multi-scale feature modules before the attention module. Our model is trained and evaluated using a dataset comprising 1300 fundus images. The experimental outcomes illustrate the successful realization of our objectives, with the model achieving an accuracy of 91.50% and an area under the receiver operating characteristic curve (AUC) of 96.00% in PC classification, and an accuracy of 95.60% and an AUC of 99.20% in CP classification. Our study establishes a characterizing link between ocular features and major pancreatic diseases, providing a non-invasive, convenient, and complementary method for screening and detection of pancreatic diseases.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1007/s00371-024-03614-w
Liangrui Wei, Feifei Xie, Lin Sun, Jinpeng Chen, Zhipeng Zhang
The 6D pose estimation based on RGB-D data holds significant application value in computer vision and related fields. Currently, deep learning methods commonly employ convolutional networks for feature extraction, which are sensitive to keypoints at close distances but overlook information related to keypoints at longer distances. Moreover, in subsequent stages, there is a failure to effectively fuse spatial features (depth channel features) and color texture features (RGB channel features). Consequently, this limitation results in compromised accuracy in existing 6D pose networks based on RGB-D data. To solve this problem, a novel end-to-end 6D pose estimation network is proposed. In the branch of depth data processing network, the global spatial weight is established by using the attention mechanism of mask vector to realize robust extraction of depth features. In the phase of feature fusion, a symmetric fusion module is introduced. In this module, spatial features and color texture features are self-related fused by means of cross-attention mechanism. Experimental evaluations were performed on the LINEMOD and LINEMOD-OCLUSION datasets, and the ADD(-S) scores of our method can reach 95.84% and 47.89%, respectively. Compared to state-of-the-art methods, our method demonstrates superior performance in pose estimation for objects with complex shapes. Moreover, in the presence of occlusion, the pose estimation accuracy of our method for asymmetric objects has been effectively improved.
{"title":"A modal fusion network with dual attention mechanism for 6D pose estimation","authors":"Liangrui Wei, Feifei Xie, Lin Sun, Jinpeng Chen, Zhipeng Zhang","doi":"10.1007/s00371-024-03614-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03614-w","url":null,"abstract":"<p>The 6D pose estimation based on RGB-D data holds significant application value in computer vision and related fields. Currently, deep learning methods commonly employ convolutional networks for feature extraction, which are sensitive to keypoints at close distances but overlook information related to keypoints at longer distances. Moreover, in subsequent stages, there is a failure to effectively fuse spatial features (depth channel features) and color texture features (RGB channel features). Consequently, this limitation results in compromised accuracy in existing 6D pose networks based on RGB-D data. To solve this problem, a novel end-to-end 6D pose estimation network is proposed. In the branch of depth data processing network, the global spatial weight is established by using the attention mechanism of mask vector to realize robust extraction of depth features. In the phase of feature fusion, a symmetric fusion module is introduced. In this module, spatial features and color texture features are self-related fused by means of cross-attention mechanism. Experimental evaluations were performed on the LINEMOD and LINEMOD-OCLUSION datasets, and the ADD(-S) scores of our method can reach 95.84% and 47.89%, respectively. Compared to state-of-the-art methods, our method demonstrates superior performance in pose estimation for objects with complex shapes. Moreover, in the presence of occlusion, the pose estimation accuracy of our method for asymmetric objects has been effectively improved.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1007/s00371-024-03612-y
Yazhuo Fan, Jianhua Song, Lei Yuan, Yunlin Jia
In recent years, for the purpose of integrating the individual strengths of convolutional neural networks (CNN) and Transformer, a network structure has been built to integrate the two methods in medical image segmentation. But most of the methods only integrate CNN and Transformer at a single level and cannot extract low-level detail features and high-level abstract information simultaneously. Meanwhile, this structure lacks flexibility, unable to dynamically adjust the contributions of different feature maps. To address these limitations, we introduce HCT-Unet, a hybrid CNN-Transformer model specifically designed for multi-organ medical images segmentation. HCT-Unet introduces a tunable hybrid paradigm that differs significantly from conventional hybrid architectures. It deploys powerful CNN to capture short-range information and Transformer to extract long-range information at each stage. Furthermore, we have designed a multi-functional multi-scale fusion bridge, which progressively integrates information from different scales and dynamically modifies attention weights for both local and global features. With the benefits of these two innovative designs, HCT-Unet demonstrates robust discriminative dependency and representation capabilities in multi-target medical image tasks. Experimental results reveal the remarkable performance of our approach in medical image segmentation tasks. Specifically, in multi-organ segmentation tasks, HCT-Unet achieved a Dice similarity coefficient (DSC) of 82.23%. Furthermore, in cardiac segmentation tasks, it reached a DSC of 91%, significantly outperforming previous state-of-the-art networks. The code has been released on Zenodo: https://zenodo.org/doi/10.5281/zenodo.11070837.
{"title":"HCT-Unet: multi-target medical image segmentation via a hybrid CNN-transformer Unet incorporating multi-axis gated multi-layer perceptron","authors":"Yazhuo Fan, Jianhua Song, Lei Yuan, Yunlin Jia","doi":"10.1007/s00371-024-03612-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03612-y","url":null,"abstract":"<p>In recent years, for the purpose of integrating the individual strengths of convolutional neural networks (CNN) and Transformer, a network structure has been built to integrate the two methods in medical image segmentation. But most of the methods only integrate CNN and Transformer at a single level and cannot extract low-level detail features and high-level abstract information simultaneously. Meanwhile, this structure lacks flexibility, unable to dynamically adjust the contributions of different feature maps. To address these limitations, we introduce HCT-Unet, a hybrid CNN-Transformer model specifically designed for multi-organ medical images segmentation. HCT-Unet introduces a tunable hybrid paradigm that differs significantly from conventional hybrid architectures. It deploys powerful CNN to capture short-range information and Transformer to extract long-range information at each stage. Furthermore, we have designed a multi-functional multi-scale fusion bridge, which progressively integrates information from different scales and dynamically modifies attention weights for both local and global features. With the benefits of these two innovative designs, HCT-Unet demonstrates robust discriminative dependency and representation capabilities in multi-target medical image tasks. Experimental results reveal the remarkable performance of our approach in medical image segmentation tasks. Specifically, in multi-organ segmentation tasks, HCT-Unet achieved a Dice similarity coefficient (DSC) of 82.23%. Furthermore, in cardiac segmentation tasks, it reached a DSC of 91%, significantly outperforming previous state-of-the-art networks. The code has been released on Zenodo: https://zenodo.org/doi/10.5281/zenodo.11070837.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The malicious abuse of deepfakes has raised serious ethical, security, and privacy concerns, eroding public trust in digital media. While existing deepfake detectors can detect fake images, they are vulnerable to adversarial attacks. Although various adversarial attacks have been explored, most are white-box attacks difficult to realize in practice, and the generated adversarial examples have poor quality easily noticeable to the human eye. For this detection task, the goal should be to generate adversarial examples that can deceive detectors while maintaining high quality and authenticity. We propose a method to generate imperceptible and transferable adversarial examples aimed at fooling unknown deepfake detectors. The method combines a conditional residual generator with an accessible detector as a surrogate model, utilizing the detector’s relative distance loss function to generate highly transferable adversarial examples. Discrete wavelet transform is also introduced to enhance image quality. Extensive experiments demonstrate that the adversarial examples generated by our method not only possess excellent visual quality but also effectively deceive various detectors, exhibiting superior cross-detector transferability in black-box attacks. Our code is available at:https://github.com/SiSuiyuHang/ITA.
{"title":"Crafting imperceptible and transferable adversarial examples: leveraging conditional residual generator and wavelet transforms to deceive deepfake detection","authors":"Zhiyuan Li, Xin Jin, Qian Jiang, Puming Wang, Shin-Jye Lee, Shaowen Yao, Wei Zhou","doi":"10.1007/s00371-024-03605-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03605-x","url":null,"abstract":"<p>The malicious abuse of deepfakes has raised serious ethical, security, and privacy concerns, eroding public trust in digital media. While existing deepfake detectors can detect fake images, they are vulnerable to adversarial attacks. Although various adversarial attacks have been explored, most are white-box attacks difficult to realize in practice, and the generated adversarial examples have poor quality easily noticeable to the human eye. For this detection task, the goal should be to generate adversarial examples that can deceive detectors while maintaining high quality and authenticity. We propose a method to generate imperceptible and transferable adversarial examples aimed at fooling unknown deepfake detectors. The method combines a conditional residual generator with an accessible detector as a surrogate model, utilizing the detector’s relative distance loss function to generate highly transferable adversarial examples. Discrete wavelet transform is also introduced to enhance image quality. Extensive experiments demonstrate that the adversarial examples generated by our method not only possess excellent visual quality but also effectively deceive various detectors, exhibiting superior cross-detector transferability in black-box attacks. Our code is available at:https://github.com/SiSuiyuHang/ITA.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1007/s00371-024-03610-0
Weifeng Cao, Xiaoyan Lei, Jun Shi, Wanyong Liang, Jie Liu, Zongfei Bai
Recently, lightweight methods for single-image super-resolution have gained significant popularity and achieved impressive performance due to limited hardware resources. These methods demonstrate that adopting residual feature distillation is an effective way to enhance performance. However, we find that using residual connections after each block increases the model’s storage and computational cost. Therefore, to simplify the network structure and learn higher-level features and relationships between features, we use depth-wise separable convolutions, fully connected layers, and activation functions as the basic feature extraction modules. This significantly reduces computational load and the number of parameters while maintaining strong feature extraction capabilities. To further enhance model performance, we propose the hybrid attention separable block, which combines channel attention and spatial attention, thus making use of their complementary advantages. Additionally, we use depth-wise separable convolutions instead of standard convolutions, significantly reducing the computational load and the number of parameters while maintaining strong feature extraction capabilities. During the training phase, we also adopt a warm-start retraining strategy to exploit the potential of the model further. Extensive experiments demonstrate the effectiveness of our approach. Our method achieves a smaller model size and reduced computational complexity without compromising performance. Code can be available at https://github.com/nathan66666/HASN.git
{"title":"HASN: hybrid attention separable network for efficient image super-resolution","authors":"Weifeng Cao, Xiaoyan Lei, Jun Shi, Wanyong Liang, Jie Liu, Zongfei Bai","doi":"10.1007/s00371-024-03610-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03610-0","url":null,"abstract":"<p>Recently, lightweight methods for single-image super-resolution have gained significant popularity and achieved impressive performance due to limited hardware resources. These methods demonstrate that adopting residual feature distillation is an effective way to enhance performance. However, we find that using residual connections after each block increases the model’s storage and computational cost. Therefore, to simplify the network structure and learn higher-level features and relationships between features, we use depth-wise separable convolutions, fully connected layers, and activation functions as the basic feature extraction modules. This significantly reduces computational load and the number of parameters while maintaining strong feature extraction capabilities. To further enhance model performance, we propose the hybrid attention separable block, which combines channel attention and spatial attention, thus making use of their complementary advantages. Additionally, we use depth-wise separable convolutions instead of standard convolutions, significantly reducing the computational load and the number of parameters while maintaining strong feature extraction capabilities. During the training phase, we also adopt a warm-start retraining strategy to exploit the potential of the model further. Extensive experiments demonstrate the effectiveness of our approach. Our method achieves a smaller model size and reduced computational complexity without compromising performance. Code can be available at https://github.com/nathan66666/HASN.git</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dense multi-label action detection is a challenging task in the field of visual action, where multiple actions occur simultaneously in different time spans, hence accurately assessing the short-term and long-term temporal dependencies between actions is crucial for action detection. There is an urgent need for an effective temporal modeling technology to detect the temporal dependence of actions in videos and efficiently learn long-term and short-term action information. This paper proposes a new method based on temporal pyramid and long short-term time modeling for multi-label action detection, which combines hierarchical structure with pyramid feature hierarchy for dense multi-label temporal action detection. By using the expansion and compression convolution module (SEC) and external attention for time modeling, we focus on the temporal relationships of long and short-term actions at each stage. We then integrate hierarchical pyramid features to achieve accurate detection of actions at different temporal resolution scales. We evaluated the performance of the model on dense multi-label benchmark datasets, and achieved mAP of 47.3% and 36.0% on the MultiTHUMOS and TSU datasets, which outperforms 2.7% and 2.3% on the current state-of-the-art results. The code is available at https://github.com/Yoona6371/TP-LSM.
{"title":"TP-LSM: visual temporal pyramidal time modeling network to multi-label action detection in image-based AI","authors":"Haojie Gao, Peishun Liu, Xiaolong Ma, Zikang Yan, Ningning Ma, Wenqiang Liu, Xuefang Wang, Ruichun Tang","doi":"10.1007/s00371-024-03601-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03601-1","url":null,"abstract":"<p>Dense multi-label action detection is a challenging task in the field of visual action, where multiple actions occur simultaneously in different time spans, hence accurately assessing the short-term and long-term temporal dependencies between actions is crucial for action detection. There is an urgent need for an effective temporal modeling technology to detect the temporal dependence of actions in videos and efficiently learn long-term and short-term action information. This paper proposes a new method based on temporal pyramid and long short-term time modeling for multi-label action detection, which combines hierarchical structure with pyramid feature hierarchy for dense multi-label temporal action detection. By using the expansion and compression convolution module (SEC) and external attention for time modeling, we focus on the temporal relationships of long and short-term actions at each stage. We then integrate hierarchical pyramid features to achieve accurate detection of actions at different temporal resolution scales. We evaluated the performance of the model on dense multi-label benchmark datasets, and achieved mAP of 47.3% and 36.0% on the MultiTHUMOS and TSU datasets, which outperforms 2.7% and 2.3% on the current state-of-the-art results. The code is available at https://github.com/Yoona6371/TP-LSM.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1007/s00371-024-03598-7
Yaqi Sun, Xiaolan Xie, Zhi Li, Kai Yang
Recognizing low-resolution text images is challenging as they often lose their detailed information, leading to poor recognition accuracy. Moreover, the traditional methods, based on deep convolutional neural networks (CNNs), are not effective enough for some low-resolution text images with dense characters. In this paper, a novel CNN-based batch-transformer network for scene text image super-resolution (BT-STISR) method is proposed to address this problem. In order to obtain the text information for text reconstruction, a pre-trained text prior module is employed to extract text information. Then a novel two pipeline batch-transformer-based module is proposed, leveraging self-attention and global attention mechanisms to exert the guidance of text prior to the text reconstruction process. Experimental study on a benchmark dataset TextZoom shows that the proposed method BT-STISR achieves the best state-of-the-art performance in terms of structural similarity (SSIM) and peak signal-to-noise ratio (PSNR) metrics compared to some latest methods.
{"title":"Batch-transformer for scene text image super-resolution","authors":"Yaqi Sun, Xiaolan Xie, Zhi Li, Kai Yang","doi":"10.1007/s00371-024-03598-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03598-7","url":null,"abstract":"<p>Recognizing low-resolution text images is challenging as they often lose their detailed information, leading to poor recognition accuracy. Moreover, the traditional methods, based on deep convolutional neural networks (CNNs), are not effective enough for some low-resolution text images with dense characters. In this paper, a novel CNN-based batch-transformer network for scene text image super-resolution (BT-STISR) method is proposed to address this problem. In order to obtain the text information for text reconstruction, a pre-trained text prior module is employed to extract text information. Then a novel two pipeline batch-transformer-based module is proposed, leveraging self-attention and global attention mechanisms to exert the guidance of text prior to the text reconstruction process. Experimental study on a benchmark dataset TextZoom shows that the proposed method BT-STISR achieves the best state-of-the-art performance in terms of structural similarity (SSIM) and peak signal-to-noise ratio (PSNR) metrics compared to some latest methods.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1007/s00371-024-03602-0
Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang
The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.
{"title":"Lightweight CNN-ViT with cross-module representational constraint for express parcel detection","authors":"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang","doi":"10.1007/s00371-024-03602-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03602-0","url":null,"abstract":"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1007/s00371-024-03419-x
Mei Zhang, Lingling Liu, Yongtao Pei, Guojing Xie, Jinghua Wen
Remote sensing images exhibit complex characteristics such as irregular multi-scale feature shapes, significant scale variations, and imbalanced sizes between different categories. These characteristics lead to a decrease in the accuracy of semantic segmentation in remote sensing images. In view of this problem, this paper presents a context feature-enhanced multi-scale remote sensing image semantic segmentation method. It utilizes a context aggregation module for global context co-aggregation, obtaining feature representations at different levels through self-similarity calculation and convolution operations. The processed features are input into a feature enhancement module, introducing a channel gate mechanism to enhance the expressive power of feature maps. This mechanism enhances feature representations by leveraging channel correlations and weighted fusion operations. Additionally, pyramid pooling is employed to capture multi-scale information from the enhanced features, so as to improve the performance and accuracy of the semantic segmentation model. Experimental results on the Vaihingen and Potsdam datasets (which are indeed publicly released at the URL: https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx) demonstrate significant improvements in the performance and accuracy of the proposed method (whose algorithm source code is indeed publicly released in Sect. 3.4), compared to previous multi-scale remote sensing image semantic segmentation approaches, verifying its effectiveness.
{"title":"Semantic segmentation of multi-scale remote sensing images with contextual feature enhancement","authors":"Mei Zhang, Lingling Liu, Yongtao Pei, Guojing Xie, Jinghua Wen","doi":"10.1007/s00371-024-03419-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03419-x","url":null,"abstract":"<p>Remote sensing images exhibit complex characteristics such as irregular multi-scale feature shapes, significant scale variations, and imbalanced sizes between different categories. These characteristics lead to a decrease in the accuracy of semantic segmentation in remote sensing images. In view of this problem, this paper presents a context feature-enhanced multi-scale remote sensing image semantic segmentation method. It utilizes a context aggregation module for global context co-aggregation, obtaining feature representations at different levels through self-similarity calculation and convolution operations. The processed features are input into a feature enhancement module, introducing a channel gate mechanism to enhance the expressive power of feature maps. This mechanism enhances feature representations by leveraging channel correlations and weighted fusion operations. Additionally, pyramid pooling is employed to capture multi-scale information from the enhanced features, so as to improve the performance and accuracy of the semantic segmentation model. Experimental results on the Vaihingen and Potsdam datasets (which are indeed publicly released at the URL: https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx) demonstrate significant improvements in the performance and accuracy of the proposed method (whose algorithm source code is indeed publicly released in Sect. 3.4), compared to previous multi-scale remote sensing image semantic segmentation approaches, verifying its effectiveness.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1007/s00371-024-03599-6
Hu Gao, Jing Yang, Ying Zhang, Ning Wang, Jingfan Yang, Depeng Dang
Image restoration is the task of aiming to obtain a high-quality image from a corrupt input image, such as deblurring and deraining. In image restoration, it is typically necessary to maintain a complex balance between spatial details and contextual information. Although a multi-stage network can optimally balance these competing goals and achieve significant performance, this also increases the system’s complexity. In this paper, we propose a mountain-shaped single-stage design, which achieves the performance of multi-stage networks through a plug-and-play feature fusion middleware. Specifically, we propose a plug-and-play feature fusion middleware mechanism as an information exchange component between the encoder-decoder architectural levels. It seamlessly integrates upper-layer information into the adjacent lower layer, sequentially down to the lowest layer. Finally, all information is fused into the original image resolution manipulation level. This preserves spatial details and integrates contextual information, ensuring high-quality image restoration. Simultaneously, we propose a multi-head attention middle block as a bridge between the encoder and decoder to capture more global information and surpass the limitations of the receptive field of CNNs. In order to achieve low system complexity, we removes or replaces unnecessary nonlinear activation functions. Extensive experiments demonstrate that our approach, named as M3SNet, outperforms previous state-of-the-art models while using less than half the computational costs, for several image restoration tasks, such as image deraining and deblurring. The code and the pre-trained models will be released at https://github.com/Tombs98/M3SNet.
{"title":"A novel single-stage network for accurate image restoration","authors":"Hu Gao, Jing Yang, Ying Zhang, Ning Wang, Jingfan Yang, Depeng Dang","doi":"10.1007/s00371-024-03599-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03599-6","url":null,"abstract":"<p>Image restoration is the task of aiming to obtain a high-quality image from a corrupt input image, such as deblurring and deraining. In image restoration, it is typically necessary to maintain a complex balance between spatial details and contextual information. Although a multi-stage network can optimally balance these competing goals and achieve significant performance, this also increases the system’s complexity. In this paper, we propose a mountain-shaped single-stage design, which achieves the performance of multi-stage networks through a plug-and-play feature fusion middleware. Specifically, we propose a plug-and-play feature fusion middleware mechanism as an information exchange component between the encoder-decoder architectural levels. It seamlessly integrates upper-layer information into the adjacent lower layer, sequentially down to the lowest layer. Finally, all information is fused into the original image resolution manipulation level. This preserves spatial details and integrates contextual information, ensuring high-quality image restoration. Simultaneously, we propose a multi-head attention middle block as a bridge between the encoder and decoder to capture more global information and surpass the limitations of the receptive field of CNNs. In order to achieve low system complexity, we removes or replaces unnecessary nonlinear activation functions. Extensive experiments demonstrate that our approach, named as M3SNet, outperforms previous state-of-the-art models while using less than half the computational costs, for several image restoration tasks, such as image deraining and deblurring. The code and the pre-trained models will be released at https://github.com/Tombs98/M3SNet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}