Lung cancer remains one of the leading causes of cancer-related deaths worldwide, underlining the urgent need for accurate and early detection and classification methods. In this paper, we present a comprehensive study that evaluates and compares different deep learning techniques for accurately distinguishing between nodule and non-nodule in 2D CT images. Our work introduced an innovative deep learning strategy called “Max-Min CNN” to improve lung nodule classification. Three models have been developed based on the Max-Min strategy: (1) a Max-Min CNN model built and trained from scratch, (2) a Bilinear Max-Min CNN composed of two Max-Min CNN streams whose outputs were bilinearly pooled by a Kronecker product, and (3) a hybrid Max-Min ViT combining a ViT model built from scratch and the proposed Max-Min CNN architecture as a backbone. To ensure an objective analysis of our findings, we evaluated each proposed model on 3186 images from the public LUNA16 database. Experimental results demonstrated the outperformance of the proposed hybrid Max-Min ViT over the Bilinear Max-Min CNN and the Max-Min CNN, with an accuracy rate of 98.03% versus 96.89% and 95.82%, respectively. This study clearly demonstrated the contribution of the Max-Min strategy in improving the effectiveness of deep learning models for pulmonary nodule classification on CT images.
{"title":"Performance analysis of various deep learning models based on Max-Min CNN for lung nodule classification on CT images","authors":"Rekka Mastouri, Nawres Khlifa, Henda Neji, Saoussen Hantous-Zannad","doi":"10.1007/s00138-024-01569-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01569-5","url":null,"abstract":"<p>Lung cancer remains one of the leading causes of cancer-related deaths worldwide, underlining the urgent need for accurate and early detection and classification methods. In this paper, we present a comprehensive study that evaluates and compares different deep learning techniques for accurately distinguishing between nodule and non-nodule in 2D CT images. Our work introduced an innovative deep learning strategy called “Max-Min CNN” to improve lung nodule classification. Three models have been developed based on the Max-Min strategy: (1) a Max-Min CNN model built and trained from scratch, (2) a Bilinear Max-Min CNN composed of two Max-Min CNN streams whose outputs were bilinearly pooled by a Kronecker product, and (3) a hybrid Max-Min ViT combining a ViT model built from scratch and the proposed Max-Min CNN architecture as a backbone. To ensure an objective analysis of our findings, we evaluated each proposed model on 3186 images from the public LUNA16 database. Experimental results demonstrated the outperformance of the proposed hybrid Max-Min ViT over the Bilinear Max-Min CNN and the Max-Min CNN, with an accuracy rate of 98.03% versus 96.89% and 95.82%, respectively. This study clearly demonstrated the contribution of the Max-Min strategy in improving the effectiveness of deep learning models for pulmonary nodule classification on CT images.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"189 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-20DOI: 10.1007/s00138-024-01562-y
Pablo MalvidoFresnillo, Wael M. Mohammed, Saigopal Vasudevan, Jose A. PerezGarcia, Jose L. MartinezLastra
Semantic segmentation is one of the most important and studied problems in machine vision, which has been solved with high accuracy by many deep learning models. However, all these models present a significant drawback, they require large and diverse datasets to be trained. Gathering and annotating all these images manually would be extremely time-consuming, hence, numerous researchers have proposed approaches to facilitate or automate the process. Nevertheless, when the objects to be segmented are deformable, such as cables, the automation of this process becomes more challenging, as the dataset needs to represent their high diversity of shapes while keeping a high level of realism, and none of the existing solutions have been able to address it effectively. Therefore, this paper proposes a novel methodology to automatically generate highly realistic synthetic datasets of cables for training deep learning models in image segmentation tasks. This methodology utilizes Blender to create photo-realistic cable scenes and a Python pipeline to introduce random variations and natural deformations. To prove its performance, a dataset composed of 25000 synthetic cable images and their corresponding masks was generated and used to train six popular deep learning segmentation models. These models were then utilized to segment real cable images achieving outstanding results (over 70% IoU and 80% Dice coefficient for all the models). Both the methodology and the generated dataset are publicly available in the project’s repository.
{"title":"Generation of realistic synthetic cable images to train deep learning segmentation models","authors":"Pablo MalvidoFresnillo, Wael M. Mohammed, Saigopal Vasudevan, Jose A. PerezGarcia, Jose L. MartinezLastra","doi":"10.1007/s00138-024-01562-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01562-y","url":null,"abstract":"<p>Semantic segmentation is one of the most important and studied problems in machine vision, which has been solved with high accuracy by many deep learning models. However, all these models present a significant drawback, they require large and diverse datasets to be trained. Gathering and annotating all these images manually would be extremely time-consuming, hence, numerous researchers have proposed approaches to facilitate or automate the process. Nevertheless, when the objects to be segmented are deformable, such as cables, the automation of this process becomes more challenging, as the dataset needs to represent their high diversity of shapes while keeping a high level of realism, and none of the existing solutions have been able to address it effectively. Therefore, this paper proposes a novel methodology to automatically generate highly realistic synthetic datasets of cables for training deep learning models in image segmentation tasks. This methodology utilizes Blender to create photo-realistic cable scenes and a Python pipeline to introduce random variations and natural deformations. To prove its performance, a dataset composed of 25000 synthetic cable images and their corresponding masks was generated and used to train six popular deep learning segmentation models. These models were then utilized to segment real cable images achieving outstanding results (over 70% IoU and 80% Dice coefficient for all the models). Both the methodology and the generated dataset are publicly available in the project’s repository.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"43 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The detection of video anomalies is a well-known issue in the realm of visual research. The volume of normal and abnormal sample data in this field is unbalanced, hence unsupervised training is generally used in research. Since the development of deep learning, the field of video anomaly has developed from reconstruction-based detection methods to prediction-based detection methods, and then to hybrid detection methods. To identify the presence of anomalies, these methods take advantage of the differences between ground-truth frames and reconstruction or prediction frames. Thus, the evaluation of the results is directly impacted by the quality of the generated frames. Built around the Dual Contrast Discriminator for Video Sequences (DCDVS) and the corresponding loss function, we present a novel hybrid detection method for further explanation. With less false positives and more accuracy, this method improves the discriminator’s guidance on the reconstruction-prediction network’s generation performance. we integrate optical flow processing and attention processes into the Auto-encoder (AE) reconstruction network. The network’s sensitivity to motion information and its ability to concentrate on important areas are improved by this integration. Additionally, DCDVS’s capacity to successfully recognize significant features gets improved by introducing the attention module implemented through parameter sharing. Aiming to reduce the risk of network overfitting, we also invented reverse augmentation, a data augmentation technique designed specifically for temporal data. Our approach achieved outstanding performance with AUC scores of 99.4, 92.9, and 77.3(%) on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively, demonstrates competitiveness with advanced methods and validates its effectiveness.
{"title":"Dual contrast discriminator with sharing attention for video anomaly detection","authors":"Yiwenhao Zeng, Yihua Chen, Songsen Yu, Mingzhang Yang, Rongrong Chen, Fang Xu","doi":"10.1007/s00138-024-01566-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01566-8","url":null,"abstract":"<p>The detection of video anomalies is a well-known issue in the realm of visual research. The volume of normal and abnormal sample data in this field is unbalanced, hence unsupervised training is generally used in research. Since the development of deep learning, the field of video anomaly has developed from reconstruction-based detection methods to prediction-based detection methods, and then to hybrid detection methods. To identify the presence of anomalies, these methods take advantage of the differences between ground-truth frames and reconstruction or prediction frames. Thus, the evaluation of the results is directly impacted by the quality of the generated frames. Built around the Dual Contrast Discriminator for Video Sequences (DCDVS) and the corresponding loss function, we present a novel hybrid detection method for further explanation. With less false positives and more accuracy, this method improves the discriminator’s guidance on the reconstruction-prediction network’s generation performance. we integrate optical flow processing and attention processes into the Auto-encoder (AE) reconstruction network. The network’s sensitivity to motion information and its ability to concentrate on important areas are improved by this integration. Additionally, DCDVS’s capacity to successfully recognize significant features gets improved by introducing the attention module implemented through parameter sharing. Aiming to reduce the risk of network overfitting, we also invented reverse augmentation, a data augmentation technique designed specifically for temporal data. Our approach achieved outstanding performance with AUC scores of 99.4, 92.9, and 77.3<span>(%)</span> on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively, demonstrates competitiveness with advanced methods and validates its effectiveness.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"230 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-18DOI: 10.1007/s00138-024-01563-x
Xuegang Hu, Wei Zhao
Image denoising is crucial for enhancing image quality, improving visual effects, and boosting the accuracy of image analysis and recognition. Most of the current image denoising methods perform superior on synthetic noise images, but their performance is limited on real-world noisy images since the types and distributions of real noise are often uncertain. To address this challenge, a multi-scale information fusion generative adversarial network method is proposed in this paper. Specifically, In this method, the generator is an end-to-end denoising network that consists of a novel encoder–decoder network branch and an improved residual network branch. The encoder–decoder branch extracts rich detailed and contextual information from images at different scales and utilizes a feature fusion method to aggregate multi-scale information, enhancing the feature representation performance of the network. The residual network further compensates for the compressed and lost information in the encoder stage. Additionally, to effectively aid the generator in accomplishing the denoising task, convolution kernels of various sizes are added to the discriminator to improve its image evaluation ability. Furthermore, the dual denoising loss function is presented to enhance the model’s capability in performing noise removal and image restoration. Experimental results show that the proposed method exhibits superior objective performance and visual quality than some state-of-the-art methods on three real-world datasets.
{"title":"Multi-scale information fusion generative adversarial network for real-world noisy image denoising","authors":"Xuegang Hu, Wei Zhao","doi":"10.1007/s00138-024-01563-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01563-x","url":null,"abstract":"<p>Image denoising is crucial for enhancing image quality, improving visual effects, and boosting the accuracy of image analysis and recognition. Most of the current image denoising methods perform superior on synthetic noise images, but their performance is limited on real-world noisy images since the types and distributions of real noise are often uncertain. To address this challenge, a multi-scale information fusion generative adversarial network method is proposed in this paper. Specifically, In this method, the generator is an end-to-end denoising network that consists of a novel encoder–decoder network branch and an improved residual network branch. The encoder–decoder branch extracts rich detailed and contextual information from images at different scales and utilizes a feature fusion method to aggregate multi-scale information, enhancing the feature representation performance of the network. The residual network further compensates for the compressed and lost information in the encoder stage. Additionally, to effectively aid the generator in accomplishing the denoising task, convolution kernels of various sizes are added to the discriminator to improve its image evaluation ability. Furthermore, the dual denoising loss function is presented to enhance the model’s capability in performing noise removal and image restoration. Experimental results show that the proposed method exhibits superior objective performance and visual quality than some state-of-the-art methods on three real-world datasets.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-02DOI: 10.1007/s00138-024-01556-w
Xu Han, Haozhe Cheng, Pengcheng Shi, Jihua Zhu
Cross-modal setting employing 2D images and 3D point clouds in self-supervised representation learning is proven to be an effective way to enhance visual perception capabilities. However, different modalities have different data formats and representations. Directly using features extracted from cross-modal datasets may lead to information conflicting and collapsing. We refer to this problem as uncertainty in network learning. Therefore, reducing uncertainty to obtain trusted descriptions has become the key to improving network performance. Motivated by this, we propose our trusted cross-modal network in self-supervised learning (TCMSS). It can obtain trusted descriptions by a trusted combination module as well as improve network performance with a well-designed loss function. In the trusted combination module, we utilize the Dirichlet distribution and the subjective logic to parameterize the features and acquire probabilistic uncertainty at the same. Then, the Dempster-Shafer Theory (DST) is used to obtain trusted descriptions by weighting uncertainty to the parameterized results. We have also designed our trusted domain loss function, including domain loss and trusted loss. It can effectively improve the prediction accuracy of the network by applying contrastive learning between different feature descriptions. The experimental results show that our model outperforms previous results on linear classification in ScanObjectNN as well as few-shot classification in both ModelNet40 and ScanObjectNN. In addition, part segmentation also reports a superior result to previous methods in ShapeNet. Further, the ablation studies validate the potency of our method for a better point cloud understanding.
{"title":"Trusted 3D self-supervised representation learning with cross-modal settings","authors":"Xu Han, Haozhe Cheng, Pengcheng Shi, Jihua Zhu","doi":"10.1007/s00138-024-01556-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01556-w","url":null,"abstract":"<p>Cross-modal setting employing 2D images and 3D point clouds in self-supervised representation learning is proven to be an effective way to enhance visual perception capabilities. However, different modalities have different data formats and representations. Directly using features extracted from cross-modal datasets may lead to information conflicting and collapsing. We refer to this problem as uncertainty in network learning. Therefore, reducing uncertainty to obtain trusted descriptions has become the key to improving network performance. Motivated by this, we propose our trusted cross-modal network in self-supervised learning (TCMSS). It can obtain trusted descriptions by a trusted combination module as well as improve network performance with a well-designed loss function. In the trusted combination module, we utilize the Dirichlet distribution and the subjective logic to parameterize the features and acquire probabilistic uncertainty at the same. Then, the Dempster-Shafer Theory (DST) is used to obtain trusted descriptions by weighting uncertainty to the parameterized results. We have also designed our trusted domain loss function, including domain loss and trusted loss. It can effectively improve the prediction accuracy of the network by applying contrastive learning between different feature descriptions. The experimental results show that our model outperforms previous results on linear classification in ScanObjectNN as well as few-shot classification in both ModelNet40 and ScanObjectNN. In addition, part segmentation also reports a superior result to previous methods in ShapeNet. Further, the ablation studies validate the potency of our method for a better point cloud understanding.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"32 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1007/s00138-024-01557-9
Wuttichai Vijitkunsawat, Teeradaj Racharak, Minh Le Nguyen
Video-based sign language recognition is vital for improving communication for the deaf and hard of hearing. Creating and maintaining quality of Thai sign language video datasets is challenging due to a lack of resources. Tackling this issue, we rigorously investigate a design and development of deep learning-based system for Thai Finger Spelling recognition, assessing various models with a new dataset of 90 standard letters performed by 43 diverse signers. We investigate seven deep learning models with three distinct modalities for our analysis: video-only methods (including RGB-sequencing-based CNN-LSTM and VGG-LSTM), human body joint coordinate sequences (processed by LSTM, BiLSTM, GRU, and Transformer models), and skeleton analysis (using TGCN with graph-structured skeleton representation). A thorough assessment of these models is conducted across seven circumstances, encompassing single-hand postures, single-hand motions with one, two, and three strokes, as well as two-hand postures with both static and dynamic point-on-hand interactions. The research highlights that the TGCN model is the optimal lightweight model in all scenarios. In single-hand pose cases, a combination of the Transformer and TGCN models of two modalities delivers outstanding performance, excelling in four particular conditions: single-hand poses, single-hand poses requiring one, two, and three strokes. In contrast, two-hand poses with static or dynamic point-on-hand interactions present substantial challenges, as the data from joint coordinates is inadequate due to hand obstructions, stemming from insufficient coordinate sequence data and the lack of a detailed skeletal graph structure. The study recommends integrating RGB-sequencing with visual modality to enhance the accuracy of two-handed sign language gestures.
{"title":"Deep multimodal-based finger spelling recognition for Thai sign language: a new benchmark and model composition","authors":"Wuttichai Vijitkunsawat, Teeradaj Racharak, Minh Le Nguyen","doi":"10.1007/s00138-024-01557-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01557-9","url":null,"abstract":"<p>Video-based sign language recognition is vital for improving communication for the deaf and hard of hearing. Creating and maintaining quality of Thai sign language video datasets is challenging due to a lack of resources. Tackling this issue, we rigorously investigate a design and development of deep learning-based system for Thai Finger Spelling recognition, assessing various models with a new dataset of 90 standard letters performed by 43 diverse signers. We investigate seven deep learning models with three distinct modalities for our analysis: video-only methods (including RGB-sequencing-based CNN-LSTM and VGG-LSTM), human body joint coordinate sequences (processed by LSTM, BiLSTM, GRU, and Transformer models), and skeleton analysis (using TGCN with graph-structured skeleton representation). A thorough assessment of these models is conducted across seven circumstances, encompassing single-hand postures, single-hand motions with one, two, and three strokes, as well as two-hand postures with both static and dynamic point-on-hand interactions. The research highlights that the TGCN model is the optimal lightweight model in all scenarios. In single-hand pose cases, a combination of the Transformer and TGCN models of two modalities delivers outstanding performance, excelling in four particular conditions: single-hand poses, single-hand poses requiring one, two, and three strokes. In contrast, two-hand poses with static or dynamic point-on-hand interactions present substantial challenges, as the data from joint coordinates is inadequate due to hand obstructions, stemming from insufficient coordinate sequence data and the lack of a detailed skeletal graph structure. The study recommends integrating RGB-sequencing with visual modality to enhance the accuracy of two-handed sign language gestures.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"71 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1007/s00138-024-01555-x
Anish Monsley Kirupakaran, Rabul Hussain Laskar
Complexity intensifies when gesticulations span various scales. Traditional scale-invariant object recognition methods often falter when confronted with case-sensitive characters in the English alphabet. The literature underscores a notable gap, the absence of an open-source multi-scale un-instructional gesture database featuring a comprehensive dictionary. In response, we have created the NITS (gesture scale) database, which encompasses isolated mid-air gesticulations of ninety-five alphanumeric characters. In this research, we present a scale-centric framework that addresses three critical aspects: (1) detection of smaller gesture objects: our framework excels at detecting smaller gesture objects, such as a red color marker. (2) Removal of redundant self co-articulated strokes: we propose an effective approach to eliminate redundant self co-articulated strokes often present in gesture trajectories. (3) Scale-variant approach for recognition: to tackle the scale vs. size ambiguity in recognition, we introduce a novel scale-variant methodology. Our experimental results reveal a substantial improvement of approximately 16% compared to existing state-of-the-art recognition models for mid-air gesture recognition. These outcomes demonstrate that our proposed approach successfully emulates the perceptibility found in the human visual system, even when utilizing data from monophthalmic vision. Furthermore, our findings underscore the imperative need for comprehensive studies encompassing scale variations in gesture recognition.
{"title":"Scale-adaptive gesture computing: detection, tracking and recognition in controlled complex environments","authors":"Anish Monsley Kirupakaran, Rabul Hussain Laskar","doi":"10.1007/s00138-024-01555-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01555-x","url":null,"abstract":"<p>Complexity intensifies when gesticulations span various scales. Traditional scale-invariant object recognition methods often falter when confronted with case-sensitive characters in the English alphabet. The literature underscores a notable gap, the absence of an open-source multi-scale un-instructional gesture database featuring a comprehensive dictionary. In response, we have created the NITS (gesture scale) database, which encompasses isolated mid-air gesticulations of ninety-five alphanumeric characters. In this research, we present a scale-centric framework that addresses three critical aspects: (1) detection of smaller gesture objects: our framework excels at detecting smaller gesture objects, such as a red color marker. (2) Removal of redundant self co-articulated strokes: we propose an effective approach to eliminate redundant self co-articulated strokes often present in gesture trajectories. (3) Scale-variant approach for recognition: to tackle the scale vs. size ambiguity in recognition, we introduce a novel scale-variant methodology. Our experimental results reveal a substantial improvement of approximately 16% compared to existing state-of-the-art recognition models for mid-air gesture recognition. These outcomes demonstrate that our proposed approach successfully emulates the perceptibility found in the human visual system, even when utilizing data from monophthalmic vision. Furthermore, our findings underscore the imperative need for comprehensive studies encompassing scale variations in gesture recognition.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salient object detection (SOD) is designed to mimic human visual mechanisms to identify and segment the most salient part of an image. Although related works have achieved great progress in SOD, they are limited when it comes to interferences of non-salient objects, finely shaped objects and co-salient objects. To improve the effectiveness and capability of SOD, we propose a supervised contrastive learning network with multi-scale interaction and integrity learning named SCLNet. It adopts contrastive learning (CL), multi-reception field confusion (MRFC) and context enhancement (CE) mechanisms. Using this method, the input image is first divided into two branches after two different data augmentations. Unlike existing models, which focus more on boundary guidance, we add a random position mask on one branch to break the continuous of objects. Through the CL module, we obtain more semantic information than appearance information by learning the invariance of different data augmentations. The MRFC module is then designed to learn the internal connections and common influences of various reception field features layer by layer. Next, the obtained features are learned through the CE module for the integrity and continuity of salient objects. Finally, comprehensive evaluations on five challenging benchmark datasets show that SCLNet achieves superior results. Code is available at https://github.com/YuPangpangpang/SCLNet.
{"title":"Supervised contrastive learning with multi-scale interaction and integrity learning for salient object detection","authors":"Yu Bi, Zhenxue Chen, Chengyun Liu, Tian Liang, Fei Zheng","doi":"10.1007/s00138-024-01552-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01552-0","url":null,"abstract":"<p>Salient object detection (SOD) is designed to mimic human visual mechanisms to identify and segment the most salient part of an image. Although related works have achieved great progress in SOD, they are limited when it comes to interferences of non-salient objects, finely shaped objects and co-salient objects. To improve the effectiveness and capability of SOD, we propose a supervised contrastive learning network with multi-scale interaction and integrity learning named SCLNet. It adopts contrastive learning (CL), multi-reception field confusion (MRFC) and context enhancement (CE) mechanisms. Using this method, the input image is first divided into two branches after two different data augmentations. Unlike existing models, which focus more on boundary guidance, we add a random position mask on one branch to break the continuous of objects. Through the CL module, we obtain more semantic information than appearance information by learning the invariance of different data augmentations. The MRFC module is then designed to learn the internal connections and common influences of various reception field features layer by layer. Next, the obtained features are learned through the CE module for the integrity and continuity of salient objects. Finally, comprehensive evaluations on five challenging benchmark datasets show that SCLNet achieves superior results. Code is available at https://github.com/YuPangpangpang/SCLNet.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"07 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate medical image classification poses a significant challenge in designing expert computer-aided diagnosis systems. While deep learning approaches have shown remarkable advancements over traditional techniques, addressing inter-class similarity and intra-class dissimilarity across medical imaging modalities remains challenging. This work introduces the advanced gating transformer network (MedTransNet), a deep learning model tailored for precise medical image classification. MedTransNet utilizes channel and multi-gate attention mechanisms, coupled with residual interconnections, to learn category-specific attention representations from diverse medical imaging modalities. Additionally, the use of gradient centralization during training helps in preventing overfitting and improving generalization, which is especially important in medical imaging applications where the availability of labeled data is often limited. Evaluation on benchmark datasets, including APTOS-2019, Figshare, and SARS-CoV-2, demonstrates effectiveness of the proposed MedTransNet across tasks such as diabetic retinopathy severity grading, multi-class brain tumor classification, and COVID-19 detection. Experimental results showcase MedTransNet achieving 85.68% accuracy for retinopathy grading, 98.37% ((pm ,0.44)) for tumor classification, and 99.60% for COVID-19 detection, surpassing recent deep learning models. MedTransNet holds promise for significantly improving medical image classification accuracy.
{"title":"Medtransnet: advanced gating transformer network for medical image classification","authors":"Nagur Shareef Shaik, Teja Krishna Cherukuri, N Veeranjaneulu, Jyostna Devi Bodapati","doi":"10.1007/s00138-024-01542-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01542-2","url":null,"abstract":"<p>Accurate medical image classification poses a significant challenge in designing expert computer-aided diagnosis systems. While deep learning approaches have shown remarkable advancements over traditional techniques, addressing inter-class similarity and intra-class dissimilarity across medical imaging modalities remains challenging. This work introduces the advanced gating transformer network (MedTransNet), a deep learning model tailored for precise medical image classification. MedTransNet utilizes channel and multi-gate attention mechanisms, coupled with residual interconnections, to learn category-specific attention representations from diverse medical imaging modalities. Additionally, the use of gradient centralization during training helps in preventing overfitting and improving generalization, which is especially important in medical imaging applications where the availability of labeled data is often limited. Evaluation on benchmark datasets, including APTOS-2019, Figshare, and SARS-CoV-2, demonstrates effectiveness of the proposed MedTransNet across tasks such as diabetic retinopathy severity grading, multi-class brain tumor classification, and COVID-19 detection. Experimental results showcase MedTransNet achieving 85.68% accuracy for retinopathy grading, 98.37% (<span>(pm ,0.44)</span>) for tumor classification, and 99.60% for COVID-19 detection, surpassing recent deep learning models. MedTransNet holds promise for significantly improving medical image classification accuracy.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"55 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-28DOI: 10.1007/s00138-024-01551-1
Qijun Song, Siyun Zhou, Die Chen
Few-shot learning for image classification comes up as a hot topic in computer vision, which aims at fast learning from a limited number of labeled images and generalize over the new tasks. In this paper, motivated by the idea of Fisher Score, we propose a Discriminative Local Descriptors Attention model that uses the ratio of intra-class and inter-class similarity to adaptively highlight the representative local descriptors without introducing any additional parameters, while most of the existing local descriptors based methods utilize the neural networks that inevitably involve the tedious parameter tuning. Experiments on four benchmark datasets show that our method achieves higher accuracy compared with the state-of-art approaches for few-shot learning. Specifically, our method is optimal on the CUB-200 dataset, and outperforms the second best competitive algorithm by 4.12(%) and 0.49(%) under the 5-way 1-shot and 5-way 5-shot settings, respectively.
图像分类的快速学习是计算机视觉领域的一个热门话题,其目的是从数量有限的标注图像中快速学习,并在新任务中实现泛化。现有的基于局部描述符的方法大多使用神经网络,不可避免地会涉及繁琐的参数调整,而本文受 Fisher Score 的思想启发,提出了一种 Discriminative Local Descriptors Attention 模型,利用类内和类间相似性的比率自适应地突出具有代表性的局部描述符,而无需引入任何额外参数。在四个基准数据集上进行的实验表明,我们的方法与最先进的少量学习方法相比具有更高的准确性。具体来说,我们的方法在CUB-200数据集上是最优的,在5路1-shot和5路5-shot设置下,分别比第二好的竞争算法高出4.12(%)和0.49(%)。
{"title":"Learning more discriminative local descriptors with parameter-free weighted attention for few-shot learning","authors":"Qijun Song, Siyun Zhou, Die Chen","doi":"10.1007/s00138-024-01551-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01551-1","url":null,"abstract":"<p>Few-shot learning for image classification comes up as a hot topic in computer vision, which aims at fast learning from a limited number of labeled images and generalize over the new tasks. In this paper, motivated by the idea of Fisher Score, we propose a Discriminative Local Descriptors Attention model that uses the ratio of intra-class and inter-class similarity to adaptively highlight the representative local descriptors without introducing any additional parameters, while most of the existing local descriptors based methods utilize the neural networks that inevitably involve the tedious parameter tuning. Experiments on four benchmark datasets show that our method achieves higher accuracy compared with the state-of-art approaches for few-shot learning. Specifically, our method is optimal on the CUB-200 dataset, and outperforms the second best competitive algorithm by 4.12<span>(%)</span> and 0.49<span>(%)</span> under the 5-way 1-shot and 5-way 5-shot settings, respectively.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"38 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}