Machine Vision and Applications最新文献_第5页

Performance analysis of various deep learning models based on Max-Min CNN for lung nodule classification on CT images 基于 Max-Min CNN 的各种深度学习模型在 CT 图像肺结节分类中的性能分析

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-06-20 DOI: 10.1007/s00138-024-01569-5

Rekka Mastouri, Nawres Khlifa, Henda Neji, Saoussen Hantous-Zannad

Lung cancer remains one of the leading causes of cancer-related deaths worldwide, underlining the urgent need for accurate and early detection and classification methods. In this paper, we present a comprehensive study that evaluates and compares different deep learning techniques for accurately distinguishing between nodule and non-nodule in 2D CT images. Our work introduced an innovative deep learning strategy called “Max-Min CNN” to improve lung nodule classification. Three models have been developed based on the Max-Min strategy: (1) a Max-Min CNN model built and trained from scratch, (2) a Bilinear Max-Min CNN composed of two Max-Min CNN streams whose outputs were bilinearly pooled by a Kronecker product, and (3) a hybrid Max-Min ViT combining a ViT model built from scratch and the proposed Max-Min CNN architecture as a backbone. To ensure an objective analysis of our findings, we evaluated each proposed model on 3186 images from the public LUNA16 database. Experimental results demonstrated the outperformance of the proposed hybrid Max-Min ViT over the Bilinear Max-Min CNN and the Max-Min CNN, with an accuracy rate of 98.03% versus 96.89% and 95.82%, respectively. This study clearly demonstrated the contribution of the Max-Min strategy in improving the effectiveness of deep learning models for pulmonary nodule classification on CT images.

肺癌仍然是全球癌症相关死亡的主要原因之一，因此迫切需要准确的早期检测和分类方法。在本文中，我们介绍了一项综合研究，该研究评估并比较了不同的深度学习技术，以准确区分二维 CT 图像中的结节和非结节。我们的研究引入了一种名为 "Max-Min CNN "的创新深度学习策略，以改进肺结节分类。基于 Max-Min 策略开发了三种模型：（1）从零开始构建和训练的 Max-Min CNN 模型；（2）由两个 Max-Min CNN 流组成的双线性 Max-Min CNN，其输出通过 Kronecker 乘积进行双线性汇集；以及（3）混合 Max-Min ViT，将从零开始构建的 ViT 模型与所提出的 Max-Min CNN 架构相结合作为骨干。为确保对研究结果进行客观分析，我们对公共 LUNA16 数据库中的 3186 幅图像进行了评估。实验结果表明，与双线性 Max-Min CNN 和 Max-Min CNN 相比，所提出的混合 Max-Min ViT 的准确率为 98.03%，而双线性 Max-Min CNN 和 Max-Min CNN 的准确率分别为 96.89% 和 95.82%。这项研究清楚地表明，Max-Min 策略有助于提高深度学习模型在 CT 图像肺结节分类中的有效性。

{"title":"Performance analysis of various deep learning models based on Max-Min CNN for lung nodule classification on CT images","authors":"Rekka Mastouri, Nawres Khlifa, Henda Neji, Saoussen Hantous-Zannad","doi":"10.1007/s00138-024-01569-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01569-5","url":null,"abstract":"Lung cancer remains one of the leading causes of cancer-related deaths worldwide, underlining the urgent need for accurate and early detection and classification methods. In this paper, we present a comprehensive study that evaluates and compares different deep learning techniques for accurately distinguishing between nodule and non-nodule in 2D CT images. Our work introduced an innovative deep learning strategy called “Max-Min CNN” to improve lung nodule classification. Three models have been developed based on the Max-Min strategy: (1) a Max-Min CNN model built and trained from scratch, (2) a Bilinear Max-Min CNN composed of two Max-Min CNN streams whose outputs were bilinearly pooled by a Kronecker product, and (3) a hybrid Max-Min ViT combining a ViT model built from scratch and the proposed Max-Min CNN architecture as a backbone. To ensure an objective analysis of our findings, we evaluated each proposed model on 3186 images from the public LUNA16 database. Experimental results demonstrated the outperformance of the proposed hybrid Max-Min ViT over the Bilinear Max-Min CNN and the Max-Min CNN, with an accuracy rate of 98.03% versus 96.89% and 95.82%, respectively. This study clearly demonstrated the contribution of the Max-Min strategy in improving the effectiveness of deep learning models for pulmonary nodule classification on CT images.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"189 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generation of realistic synthetic cable images to train deep learning segmentation models 生成逼真的合成电缆图像以训练深度学习分割模型

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-06-20 DOI: 10.1007/s00138-024-01562-y

Pablo MalvidoFresnillo, Wael M. Mohammed, Saigopal Vasudevan, Jose A. PerezGarcia, Jose L. MartinezLastra

Semantic segmentation is one of the most important and studied problems in machine vision, which has been solved with high accuracy by many deep learning models. However, all these models present a significant drawback, they require large and diverse datasets to be trained. Gathering and annotating all these images manually would be extremely time-consuming, hence, numerous researchers have proposed approaches to facilitate or automate the process. Nevertheless, when the objects to be segmented are deformable, such as cables, the automation of this process becomes more challenging, as the dataset needs to represent their high diversity of shapes while keeping a high level of realism, and none of the existing solutions have been able to address it effectively. Therefore, this paper proposes a novel methodology to automatically generate highly realistic synthetic datasets of cables for training deep learning models in image segmentation tasks. This methodology utilizes Blender to create photo-realistic cable scenes and a Python pipeline to introduce random variations and natural deformations. To prove its performance, a dataset composed of 25000 synthetic cable images and their corresponding masks was generated and used to train six popular deep learning segmentation models. These models were then utilized to segment real cable images achieving outstanding results (over 70% IoU and 80% Dice coefficient for all the models). Both the methodology and the generated dataset are publicly available in the project’s repository.

语义分割是机器视觉领域最重要、研究最深入的问题之一，许多深度学习模型已经高精度地解决了这一问题。然而，所有这些模型都有一个显著的缺点，那就是需要大量不同的数据集来进行训练。手动收集和标注所有这些图像将非常耗时，因此，许多研究人员提出了促进或自动完成这一过程的方法。然而，当要分割的对象是可变形的（如电缆）时，这一过程的自动化就变得更具挑战性，因为数据集需要在保持高度逼真性的同时表现出其形状的多样性，而现有的解决方案都无法有效解决这一问题。因此，本文提出了一种新颖的方法来自动生成高度逼真的电缆合成数据集，用于训练图像分割任务中的深度学习模型。该方法利用 Blender 创建照片般逼真的电缆场景，并利用 Python 管道引入随机变化和自然变形。为了证明其性能，我们生成了一个由 25000 张合成电缆图像及其相应掩码组成的数据集，并用它来训练六个流行的深度学习分割模型。然后，利用这些模型对真实电缆图像进行分割，取得了出色的效果（所有模型的 IoU 和 Dice 系数分别超过 70% 和 80%）。该方法和生成的数据集均可在项目资源库中公开获取。

{"title":"Generation of realistic synthetic cable images to train deep learning segmentation models","authors":"Pablo MalvidoFresnillo, Wael M. Mohammed, Saigopal Vasudevan, Jose A. PerezGarcia, Jose L. MartinezLastra","doi":"10.1007/s00138-024-01562-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01562-y","url":null,"abstract":"Semantic segmentation is one of the most important and studied problems in machine vision, which has been solved with high accuracy by many deep learning models. However, all these models present a significant drawback, they require large and diverse datasets to be trained. Gathering and annotating all these images manually would be extremely time-consuming, hence, numerous researchers have proposed approaches to facilitate or automate the process. Nevertheless, when the objects to be segmented are deformable, such as cables, the automation of this process becomes more challenging, as the dataset needs to represent their high diversity of shapes while keeping a high level of realism, and none of the existing solutions have been able to address it effectively. Therefore, this paper proposes a novel methodology to automatically generate highly realistic synthetic datasets of cables for training deep learning models in image segmentation tasks. This methodology utilizes Blender to create photo-realistic cable scenes and a Python pipeline to introduce random variations and natural deformations. To prove its performance, a dataset composed of 25000 synthetic cable images and their corresponding masks was generated and used to train six popular deep learning segmentation models. These models were then utilized to segment real cable images achieving outstanding results (over 70% IoU and 80% Dice coefficient for all the models). Both the methodology and the generated dataset are publicly available in the project’s repository.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"43 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual contrast discriminator with sharing attention for video anomaly detection 用于视频异常检测的共享注意力双对比度判别器

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-06-19 DOI: 10.1007/s00138-024-01566-8

Yiwenhao Zeng, Yihua Chen, Songsen Yu, Mingzhang Yang, Rongrong Chen, Fang Xu

The detection of video anomalies is a well-known issue in the realm of visual research. The volume of normal and abnormal sample data in this field is unbalanced, hence unsupervised training is generally used in research. Since the development of deep learning, the field of video anomaly has developed from reconstruction-based detection methods to prediction-based detection methods, and then to hybrid detection methods. To identify the presence of anomalies, these methods take advantage of the differences between ground-truth frames and reconstruction or prediction frames. Thus, the evaluation of the results is directly impacted by the quality of the generated frames. Built around the Dual Contrast Discriminator for Video Sequences (DCDVS) and the corresponding loss function, we present a novel hybrid detection method for further explanation. With less false positives and more accuracy, this method improves the discriminator’s guidance on the reconstruction-prediction network’s generation performance. we integrate optical flow processing and attention processes into the Auto-encoder (AE) reconstruction network. The network’s sensitivity to motion information and its ability to concentrate on important areas are improved by this integration. Additionally, DCDVS’s capacity to successfully recognize significant features gets improved by introducing the attention module implemented through parameter sharing. Aiming to reduce the risk of network overfitting, we also invented reverse augmentation, a data augmentation technique designed specifically for temporal data. Our approach achieved outstanding performance with AUC scores of 99.4, 92.9, and 77.3(%) on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively, demonstrates competitiveness with advanced methods and validates its effectiveness.

视频异常检测是视觉研究领域一个众所周知的问题。该领域的正常和异常样本数据量不平衡，因此研究中一般采用无监督训练。自深度学习发展以来，视频异常领域已从基于重构的检测方法发展到基于预测的检测方法，再发展到混合检测方法。这些方法利用地面实况帧与重构帧或预测帧之间的差异来识别异常的存在。因此，结果的评估直接受到生成帧质量的影响。围绕视频序列双对比度判别器（DCDVS）和相应的损失函数，我们提出了一种新颖的混合检测方法，以作进一步解释。这种方法的误报率更低，准确性更高，从而提高了判别器对重建预测网络生成性能的指导作用。这种整合提高了网络对运动信息的灵敏度和集中于重要区域的能力。此外，通过引入通过参数共享实现的注意力模块，DCDVS 成功识别重要特征的能力也得到了提高。为了降低网络过拟合的风险，我们还发明了反向增强技术，这是一种专为时态数据设计的数据增强技术。在 UCSD Ped2、CUHK Avenue 和 ShanghaiTech 数据集上，我们的方法取得了出色的性能，AUC 分数分别为 99.4、92.9 和 77.3（%），显示了与先进方法的竞争力，并验证了其有效性。

{"title":"Dual contrast discriminator with sharing attention for video anomaly detection","authors":"Yiwenhao Zeng, Yihua Chen, Songsen Yu, Mingzhang Yang, Rongrong Chen, Fang Xu","doi":"10.1007/s00138-024-01566-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01566-8","url":null,"abstract":"The detection of video anomalies is a well-known issue in the realm of visual research. The volume of normal and abnormal sample data in this field is unbalanced, hence unsupervised training is generally used in research. Since the development of deep learning, the field of video anomaly has developed from reconstruction-based detection methods to prediction-based detection methods, and then to hybrid detection methods. To identify the presence of anomalies, these methods take advantage of the differences between ground-truth frames and reconstruction or prediction frames. Thus, the evaluation of the results is directly impacted by the quality of the generated frames. Built around the Dual Contrast Discriminator for Video Sequences (DCDVS) and the corresponding loss function, we present a novel hybrid detection method for further explanation. With less false positives and more accuracy, this method improves the discriminator’s guidance on the reconstruction-prediction network’s generation performance. we integrate optical flow processing and attention processes into the Auto-encoder (AE) reconstruction network. The network’s sensitivity to motion information and its ability to concentrate on important areas are improved by this integration. Additionally, DCDVS’s capacity to successfully recognize significant features gets improved by introducing the attention module implemented through parameter sharing. Aiming to reduce the risk of network overfitting, we also invented reverse augmentation, a data augmentation technique designed specifically for temporal data. Our approach achieved outstanding performance with AUC scores of 99.4, 92.9, and 77.3(%) on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively, demonstrates competitiveness with advanced methods and validates its effectiveness.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"230 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-scale information fusion generative adversarial network for real-world noisy image denoising 用于真实世界噪声图像去噪的多尺度信息融合生成对抗网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-06-18 DOI: 10.1007/s00138-024-01563-x

Xuegang Hu, Wei Zhao

Image denoising is crucial for enhancing image quality, improving visual effects, and boosting the accuracy of image analysis and recognition. Most of the current image denoising methods perform superior on synthetic noise images, but their performance is limited on real-world noisy images since the types and distributions of real noise are often uncertain. To address this challenge, a multi-scale information fusion generative adversarial network method is proposed in this paper. Specifically, In this method, the generator is an end-to-end denoising network that consists of a novel encoder–decoder network branch and an improved residual network branch. The encoder–decoder branch extracts rich detailed and contextual information from images at different scales and utilizes a feature fusion method to aggregate multi-scale information, enhancing the feature representation performance of the network. The residual network further compensates for the compressed and lost information in the encoder stage. Additionally, to effectively aid the generator in accomplishing the denoising task, convolution kernels of various sizes are added to the discriminator to improve its image evaluation ability. Furthermore, the dual denoising loss function is presented to enhance the model’s capability in performing noise removal and image restoration. Experimental results show that the proposed method exhibits superior objective performance and visual quality than some state-of-the-art methods on three real-world datasets.

图像去噪对于提高图像质量、改善视觉效果以及提高图像分析和识别的准确性至关重要。目前大多数图像去噪方法在合成噪声图像上表现优异，但在真实世界的噪声图像上表现有限，因为真实噪声的类型和分布往往是不确定的。为了应对这一挑战，本文提出了一种多尺度信息融合生成对抗网络方法。具体来说，该方法的生成器是一个端到端去噪网络，由一个新颖的编码器-解码器网络分支和一个改进的残差网络分支组成。编码器-解码器分支从不同尺度的图像中提取丰富的细节信息和上下文信息，并利用特征融合方法汇总多尺度信息，从而提高网络的特征表示性能。残差网络可进一步补偿编码器阶段压缩和丢失的信息。此外，为了有效地帮助生成器完成去噪任务，在鉴别器中加入了不同大小的卷积核，以提高其图像评估能力。此外，还提出了双重去噪损失函数，以增强模型的去噪和图像复原能力。实验结果表明，在三个真实世界数据集上，所提出的方法在客观性能和视觉质量上都优于一些最先进的方法。

{"title":"Multi-scale information fusion generative adversarial network for real-world noisy image denoising","authors":"Xuegang Hu, Wei Zhao","doi":"10.1007/s00138-024-01563-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01563-x","url":null,"abstract":"Image denoising is crucial for enhancing image quality, improving visual effects, and boosting the accuracy of image analysis and recognition. Most of the current image denoising methods perform superior on synthetic noise images, but their performance is limited on real-world noisy images since the types and distributions of real noise are often uncertain. To address this challenge, a multi-scale information fusion generative adversarial network method is proposed in this paper. Specifically, In this method, the generator is an end-to-end denoising network that consists of a novel encoder–decoder network branch and an improved residual network branch. The encoder–decoder branch extracts rich detailed and contextual information from images at different scales and utilizes a feature fusion method to aggregate multi-scale information, enhancing the feature representation performance of the network. The residual network further compensates for the compressed and lost information in the encoder stage. Additionally, to effectively aid the generator in accomplishing the denoising task, convolution kernels of various sizes are added to the discriminator to improve its image evaluation ability. Furthermore, the dual denoising loss function is presented to enhance the model’s capability in performing noise removal and image restoration. Experimental results show that the proposed method exhibits superior objective performance and visual quality than some state-of-the-art methods on three real-world datasets.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trusted 3D self-supervised representation learning with cross-modal settings 跨模态设置的可信 3D 自监督表征学习

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-06-02 DOI: 10.1007/s00138-024-01556-w

Xu Han, Haozhe Cheng, Pengcheng Shi, Jihua Zhu

Cross-modal setting employing 2D images and 3D point clouds in self-supervised representation learning is proven to be an effective way to enhance visual perception capabilities. However, different modalities have different data formats and representations. Directly using features extracted from cross-modal datasets may lead to information conflicting and collapsing. We refer to this problem as uncertainty in network learning. Therefore, reducing uncertainty to obtain trusted descriptions has become the key to improving network performance. Motivated by this, we propose our trusted cross-modal network in self-supervised learning (TCMSS). It can obtain trusted descriptions by a trusted combination module as well as improve network performance with a well-designed loss function. In the trusted combination module, we utilize the Dirichlet distribution and the subjective logic to parameterize the features and acquire probabilistic uncertainty at the same. Then, the Dempster-Shafer Theory (DST) is used to obtain trusted descriptions by weighting uncertainty to the parameterized results. We have also designed our trusted domain loss function, including domain loss and trusted loss. It can effectively improve the prediction accuracy of the network by applying contrastive learning between different feature descriptions. The experimental results show that our model outperforms previous results on linear classification in ScanObjectNN as well as few-shot classification in both ModelNet40 and ScanObjectNN. In addition, part segmentation also reports a superior result to previous methods in ShapeNet. Further, the ablation studies validate the potency of our method for a better point cloud understanding.

事实证明，在自我监督表征学习中采用二维图像和三维点云的跨模态设置是提高视觉感知能力的有效方法。然而，不同模态有不同的数据格式和表示方法。直接使用从跨模态数据集中提取的特征可能会导致信息冲突和坍塌。我们把这个问题称为网络学习中的不确定性。因此，减少不确定性以获得可信的描述已成为提高网络性能的关键。受此启发，我们提出了自监督学习中的可信跨模态网络（TCMSS）。它可以通过可信组合模块获得可信描述，并通过精心设计的损失函数提高网络性能。在可信组合模块中，我们利用 Dirichlet 分布和主观逻辑对特征进行参数化，同时获取概率不确定性。然后，通过对参数化结果的不确定性进行加权，利用 Dempster-Shafer 理论（DST）获得可信描述。我们还设计了可信域损失函数，包括域损失和可信损失。通过对不同的特征描述进行对比学习，它可以有效提高网络的预测精度。实验结果表明，我们的模型在 ScanObjectNN 的线性分类以及 ModelNet40 和 ScanObjectNN 的少拍分类上都优于之前的结果。此外，在 ShapeNet 中的部件分割结果也优于之前的方法。此外，消融研究也验证了我们的方法能够更好地理解点云。

{"title":"Trusted 3D self-supervised representation learning with cross-modal settings","authors":"Xu Han, Haozhe Cheng, Pengcheng Shi, Jihua Zhu","doi":"10.1007/s00138-024-01556-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01556-w","url":null,"abstract":"Cross-modal setting employing 2D images and 3D point clouds in self-supervised representation learning is proven to be an effective way to enhance visual perception capabilities. However, different modalities have different data formats and representations. Directly using features extracted from cross-modal datasets may lead to information conflicting and collapsing. We refer to this problem as uncertainty in network learning. Therefore, reducing uncertainty to obtain trusted descriptions has become the key to improving network performance. Motivated by this, we propose our trusted cross-modal network in self-supervised learning (TCMSS). It can obtain trusted descriptions by a trusted combination module as well as improve network performance with a well-designed loss function. In the trusted combination module, we utilize the Dirichlet distribution and the subjective logic to parameterize the features and acquire probabilistic uncertainty at the same. Then, the Dempster-Shafer Theory (DST) is used to obtain trusted descriptions by weighting uncertainty to the parameterized results. We have also designed our trusted domain loss function, including domain loss and trusted loss. It can effectively improve the prediction accuracy of the network by applying contrastive learning between different feature descriptions. The experimental results show that our model outperforms previous results on linear classification in ScanObjectNN as well as few-shot classification in both ModelNet40 and ScanObjectNN. In addition, part segmentation also reports a superior result to previous methods in ShapeNet. Further, the ablation studies validate the potency of our method for a better point cloud understanding.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"32 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep multimodal-based finger spelling recognition for Thai sign language: a new benchmark and model composition 基于深度多模态的泰语手语手指拼写识别：新基准和模型构成

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-31 DOI: 10.1007/s00138-024-01557-9

Wuttichai Vijitkunsawat, Teeradaj Racharak, Minh Le Nguyen

Video-based sign language recognition is vital for improving communication for the deaf and hard of hearing. Creating and maintaining quality of Thai sign language video datasets is challenging due to a lack of resources. Tackling this issue, we rigorously investigate a design and development of deep learning-based system for Thai Finger Spelling recognition, assessing various models with a new dataset of 90 standard letters performed by 43 diverse signers. We investigate seven deep learning models with three distinct modalities for our analysis: video-only methods (including RGB-sequencing-based CNN-LSTM and VGG-LSTM), human body joint coordinate sequences (processed by LSTM, BiLSTM, GRU, and Transformer models), and skeleton analysis (using TGCN with graph-structured skeleton representation). A thorough assessment of these models is conducted across seven circumstances, encompassing single-hand postures, single-hand motions with one, two, and three strokes, as well as two-hand postures with both static and dynamic point-on-hand interactions. The research highlights that the TGCN model is the optimal lightweight model in all scenarios. In single-hand pose cases, a combination of the Transformer and TGCN models of two modalities delivers outstanding performance, excelling in four particular conditions: single-hand poses, single-hand poses requiring one, two, and three strokes. In contrast, two-hand poses with static or dynamic point-on-hand interactions present substantial challenges, as the data from joint coordinates is inadequate due to hand obstructions, stemming from insufficient coordinate sequence data and the lack of a detailed skeletal graph structure. The study recommends integrating RGB-sequencing with visual modality to enhance the accuracy of two-handed sign language gestures.

基于视频的手语识别对于改善聋人和重听人的交流至关重要。由于缺乏资源，创建和维护高质量的泰语手语视频数据集具有挑战性。为了解决这个问题，我们对基于深度学习的泰语手指拼写识别系统的设计和开发进行了严格的研究，用 43 位不同手语者的 90 个标准字母组成的新数据集对各种模型进行了评估。我们在分析中研究了三种不同模式的七种深度学习模型：纯视频方法（包括基于 RGB 序列的 CNN-LSTM 和 VGG-LSTM）、人体关节坐标序列（由 LSTM、BiLSTM、GRU 和 Transformer 模型处理）以及骨架分析（使用具有图结构骨架表示的 TGCN）。研究人员在七种情况下对这些模型进行了全面评估，包括单手姿势、单手一、二、三笔动作以及双手姿势与静态和动态点对点交互。研究结果表明，在所有情况下，TGCN 模型都是最佳的轻量级模型。在单手姿势的情况下，两种模式的 Transformer 和 TGCN 模型的组合具有出色的性能，在以下四种特定条件下表现出色：单手姿势、需要一个、两个和三个笔画的单手姿势。相比之下，具有静态或动态点对点交互的双手姿势则面临巨大挑战，因为由于坐标序列数据不足和缺乏详细的骨骼图结构，手部障碍物导致关节坐标数据不足。研究建议将 RGB 序列与视觉模式相结合，以提高双手手语手势的准确性。

{"title":"Deep multimodal-based finger spelling recognition for Thai sign language: a new benchmark and model composition","authors":"Wuttichai Vijitkunsawat, Teeradaj Racharak, Minh Le Nguyen","doi":"10.1007/s00138-024-01557-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01557-9","url":null,"abstract":"Video-based sign language recognition is vital for improving communication for the deaf and hard of hearing. Creating and maintaining quality of Thai sign language video datasets is challenging due to a lack of resources. Tackling this issue, we rigorously investigate a design and development of deep learning-based system for Thai Finger Spelling recognition, assessing various models with a new dataset of 90 standard letters performed by 43 diverse signers. We investigate seven deep learning models with three distinct modalities for our analysis: video-only methods (including RGB-sequencing-based CNN-LSTM and VGG-LSTM), human body joint coordinate sequences (processed by LSTM, BiLSTM, GRU, and Transformer models), and skeleton analysis (using TGCN with graph-structured skeleton representation). A thorough assessment of these models is conducted across seven circumstances, encompassing single-hand postures, single-hand motions with one, two, and three strokes, as well as two-hand postures with both static and dynamic point-on-hand interactions. The research highlights that the TGCN model is the optimal lightweight model in all scenarios. In single-hand pose cases, a combination of the Transformer and TGCN models of two modalities delivers outstanding performance, excelling in four particular conditions: single-hand poses, single-hand poses requiring one, two, and three strokes. In contrast, two-hand poses with static or dynamic point-on-hand interactions present substantial challenges, as the data from joint coordinates is inadequate due to hand obstructions, stemming from insufficient coordinate sequence data and the lack of a detailed skeletal graph structure. The study recommends integrating RGB-sequencing with visual modality to enhance the accuracy of two-handed sign language gestures.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"71 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scale-adaptive gesture computing: detection, tracking and recognition in controlled complex environments 规模自适应手势计算：受控复杂环境中的检测、跟踪和识别

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-31 DOI: 10.1007/s00138-024-01555-x

Anish Monsley Kirupakaran, Rabul Hussain Laskar

Complexity intensifies when gesticulations span various scales. Traditional scale-invariant object recognition methods often falter when confronted with case-sensitive characters in the English alphabet. The literature underscores a notable gap, the absence of an open-source multi-scale un-instructional gesture database featuring a comprehensive dictionary. In response, we have created the NITS (gesture scale) database, which encompasses isolated mid-air gesticulations of ninety-five alphanumeric characters. In this research, we present a scale-centric framework that addresses three critical aspects: (1) detection of smaller gesture objects: our framework excels at detecting smaller gesture objects, such as a red color marker. (2) Removal of redundant self co-articulated strokes: we propose an effective approach to eliminate redundant self co-articulated strokes often present in gesture trajectories. (3) Scale-variant approach for recognition: to tackle the scale vs. size ambiguity in recognition, we introduce a novel scale-variant methodology. Our experimental results reveal a substantial improvement of approximately 16% compared to existing state-of-the-art recognition models for mid-air gesture recognition. These outcomes demonstrate that our proposed approach successfully emulates the perceptibility found in the human visual system, even when utilizing data from monophthalmic vision. Furthermore, our findings underscore the imperative need for comprehensive studies encompassing scale variations in gesture recognition.

当手势跨越不同尺度时，复杂性就会增加。传统的尺度不变物体识别方法在面对英语字母表中区分大小写的字符时往往会束手无策。文献强调了一个明显的空白，即缺乏一个以综合词典为特色的开源多尺度非教学手势数据库。为此，我们创建了 NITS（手势比例）数据库，其中包含 95 个字母数字字符的孤立中空手势。在这项研究中，我们提出了一个以尺度为中心的框架，解决了三个关键问题：（1）检测较小的手势对象：我们的框架擅长检测较小的手势对象，如红色标记。(2) 消除多余的自共鸣笔画：我们提出了一种有效的方法来消除手势轨迹中经常出现的多余的自共鸣笔画。(3) 规模变量识别方法：为了解决识别中规模与大小的模糊性问题，我们引入了一种新颖的规模变量方法。实验结果表明，与现有的最先进识别模型相比，我们的中空手势识别率大幅提高了约 16%。这些结果表明，即使在利用单眼视觉数据的情况下，我们提出的方法也能成功模拟人类视觉系统的可感知性。此外，我们的研究结果还强调了对手势识别中的尺度变化进行全面研究的迫切需要。

{"title":"Scale-adaptive gesture computing: detection, tracking and recognition in controlled complex environments","authors":"Anish Monsley Kirupakaran, Rabul Hussain Laskar","doi":"10.1007/s00138-024-01555-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01555-x","url":null,"abstract":"Complexity intensifies when gesticulations span various scales. Traditional scale-invariant object recognition methods often falter when confronted with case-sensitive characters in the English alphabet. The literature underscores a notable gap, the absence of an open-source multi-scale un-instructional gesture database featuring a comprehensive dictionary. In response, we have created the NITS (gesture scale) database, which encompasses isolated mid-air gesticulations of ninety-five alphanumeric characters. In this research, we present a scale-centric framework that addresses three critical aspects: (1) detection of smaller gesture objects: our framework excels at detecting smaller gesture objects, such as a red color marker. (2) Removal of redundant self co-articulated strokes: we propose an effective approach to eliminate redundant self co-articulated strokes often present in gesture trajectories. (3) Scale-variant approach for recognition: to tackle the scale vs. size ambiguity in recognition, we introduce a novel scale-variant methodology. Our experimental results reveal a substantial improvement of approximately 16% compared to existing state-of-the-art recognition models for mid-air gesture recognition. These outcomes demonstrate that our proposed approach successfully emulates the perceptibility found in the human visual system, even when utilizing data from monophthalmic vision. Furthermore, our findings underscore the imperative need for comprehensive studies encompassing scale variations in gesture recognition.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Supervised contrastive learning with multi-scale interaction and integrity learning for salient object detection 针对突出物体检测的多尺度交互和完整性学习的有监督对比学习

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-29 DOI: 10.1007/s00138-024-01552-0

Yu Bi, Zhenxue Chen, Chengyun Liu, Tian Liang, Fei Zheng

Salient object detection (SOD) is designed to mimic human visual mechanisms to identify and segment the most salient part of an image. Although related works have achieved great progress in SOD, they are limited when it comes to interferences of non-salient objects, finely shaped objects and co-salient objects. To improve the effectiveness and capability of SOD, we propose a supervised contrastive learning network with multi-scale interaction and integrity learning named SCLNet. It adopts contrastive learning (CL), multi-reception field confusion (MRFC) and context enhancement (CE) mechanisms. Using this method, the input image is first divided into two branches after two different data augmentations. Unlike existing models, which focus more on boundary guidance, we add a random position mask on one branch to break the continuous of objects. Through the CL module, we obtain more semantic information than appearance information by learning the invariance of different data augmentations. The MRFC module is then designed to learn the internal connections and common influences of various reception field features layer by layer. Next, the obtained features are learned through the CE module for the integrity and continuity of salient objects. Finally, comprehensive evaluations on five challenging benchmark datasets show that SCLNet achieves superior results. Code is available at https://github.com/YuPangpangpang/SCLNet.

突出物体检测（SOD）旨在模仿人类的视觉机制，识别和分割图像中最突出的部分。虽然相关工作在 SOD 方面取得了很大进展，但在非突出物体、形状精细的物体和共突出物体的干扰方面，这些工作还存在局限性。为了提高 SOD 的效果和能力，我们提出了一种具有多尺度交互和完整性学习的有监督对比学习网络，并将其命名为 SCLNet。它采用了对比学习（CL）、多接收场混淆（MRFC）和上下文增强（CE）机制。利用这种方法，输入图像在经过两种不同的数据增强后首先被分为两个分支。与现有模型更注重边界引导不同，我们在一个分支上添加了一个随机位置掩码，以打破物体的连续性。通过 CL 模块，我们可以学习不同数据增强的不变性，从而获得比外观信息更多的语义信息。然后设计 MRFC 模块，逐层学习各种接收场特征的内部联系和共同影响因素。接着，通过 CE 模块学习所获得的特征，以确保突出对象的完整性和连续性。最后，在五个具有挑战性的基准数据集上进行的综合评估表明，SCLNet 取得了优异的成绩。代码见 https://github.com/YuPangpangpang/SCLNet。

{"title":"Supervised contrastive learning with multi-scale interaction and integrity learning for salient object detection","authors":"Yu Bi, Zhenxue Chen, Chengyun Liu, Tian Liang, Fei Zheng","doi":"10.1007/s00138-024-01552-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01552-0","url":null,"abstract":"Salient object detection (SOD) is designed to mimic human visual mechanisms to identify and segment the most salient part of an image. Although related works have achieved great progress in SOD, they are limited when it comes to interferences of non-salient objects, finely shaped objects and co-salient objects. To improve the effectiveness and capability of SOD, we propose a supervised contrastive learning network with multi-scale interaction and integrity learning named SCLNet. It adopts contrastive learning (CL), multi-reception field confusion (MRFC) and context enhancement (CE) mechanisms. Using this method, the input image is first divided into two branches after two different data augmentations. Unlike existing models, which focus more on boundary guidance, we add a random position mask on one branch to break the continuous of objects. Through the CL module, we obtain more semantic information than appearance information by learning the invariance of different data augmentations. The MRFC module is then designed to learn the internal connections and common influences of various reception field features layer by layer. Next, the obtained features are learned through the CE module for the integrity and continuity of salient objects. Finally, comprehensive evaluations on five challenging benchmark datasets show that SCLNet achieves superior results. Code is available at https://github.com/YuPangpangpang/SCLNet.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"07 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Medtransnet: advanced gating transformer network for medical image classification Medtransnet：用于医学图像分类的高级门控变压器网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-29 DOI: 10.1007/s00138-024-01542-2

Nagur Shareef Shaik, Teja Krishna Cherukuri, N Veeranjaneulu, Jyostna Devi Bodapati

Accurate medical image classification poses a significant challenge in designing expert computer-aided diagnosis systems. While deep learning approaches have shown remarkable advancements over traditional techniques, addressing inter-class similarity and intra-class dissimilarity across medical imaging modalities remains challenging. This work introduces the advanced gating transformer network (MedTransNet), a deep learning model tailored for precise medical image classification. MedTransNet utilizes channel and multi-gate attention mechanisms, coupled with residual interconnections, to learn category-specific attention representations from diverse medical imaging modalities. Additionally, the use of gradient centralization during training helps in preventing overfitting and improving generalization, which is especially important in medical imaging applications where the availability of labeled data is often limited. Evaluation on benchmark datasets, including APTOS-2019, Figshare, and SARS-CoV-2, demonstrates effectiveness of the proposed MedTransNet across tasks such as diabetic retinopathy severity grading, multi-class brain tumor classification, and COVID-19 detection. Experimental results showcase MedTransNet achieving 85.68% accuracy for retinopathy grading, 98.37% ((pm ,0.44)) for tumor classification, and 99.60% for COVID-19 detection, surpassing recent deep learning models. MedTransNet holds promise for significantly improving medical image classification accuracy.

准确的医学图像分类是设计专家计算机辅助诊断系统的一大挑战。虽然深度学习方法与传统技术相比取得了显著进步，但解决医学影像模式中的类间相似性和类内不相似性问题仍具有挑战性。这项研究引入了高级门控变压器网络（MedTransNet），这是一种专为精确医学影像分类而定制的深度学习模型。MedTransNet 利用通道和多门注意机制以及残差互连，从不同的医学成像模式中学习特定类别的注意表征。此外，在训练过程中使用梯度集中化有助于防止过拟合和提高泛化能力，这在医疗成像应用中尤为重要，因为标注数据的可用性往往有限。在 APTOS-2019、Figshare 和 SARS-CoV-2 等基准数据集上进行的评估证明了所提出的 MedTransNet 在糖尿病视网膜病变严重程度分级、多类脑肿瘤分类和 COVID-19 检测等任务中的有效性。实验结果表明，MedTransNet 的视网膜病变分级准确率达到 85.68%，肿瘤分类准确率达到 98.37%，COVID-19 检测准确率达到 99.60%，超过了最近的深度学习模型。MedTransNet有望显著提高医学图像分类的准确性。

{"title":"Medtransnet: advanced gating transformer network for medical image classification","authors":"Nagur Shareef Shaik, Teja Krishna Cherukuri, N Veeranjaneulu, Jyostna Devi Bodapati","doi":"10.1007/s00138-024-01542-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01542-2","url":null,"abstract":"Accurate medical image classification poses a significant challenge in designing expert computer-aided diagnosis systems. While deep learning approaches have shown remarkable advancements over traditional techniques, addressing inter-class similarity and intra-class dissimilarity across medical imaging modalities remains challenging. This work introduces the advanced gating transformer network (MedTransNet), a deep learning model tailored for precise medical image classification. MedTransNet utilizes channel and multi-gate attention mechanisms, coupled with residual interconnections, to learn category-specific attention representations from diverse medical imaging modalities. Additionally, the use of gradient centralization during training helps in preventing overfitting and improving generalization, which is especially important in medical imaging applications where the availability of labeled data is often limited. Evaluation on benchmark datasets, including APTOS-2019, Figshare, and SARS-CoV-2, demonstrates effectiveness of the proposed MedTransNet across tasks such as diabetic retinopathy severity grading, multi-class brain tumor classification, and COVID-19 detection. Experimental results showcase MedTransNet achieving 85.68% accuracy for retinopathy grading, 98.37% ((pm ,0.44)) for tumor classification, and 99.60% for COVID-19 detection, surpassing recent deep learning models. MedTransNet holds promise for significantly improving medical image classification accuracy.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"55 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141191828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning more discriminative local descriptors with parameter-free weighted attention for few-shot learning 利用无参数加权注意力学习更具辨别力的局部描述符，实现少镜头学习

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-28 DOI: 10.1007/s00138-024-01551-1

Qijun Song, Siyun Zhou, Die Chen

Few-shot learning for image classification comes up as a hot topic in computer vision, which aims at fast learning from a limited number of labeled images and generalize over the new tasks. In this paper, motivated by the idea of Fisher Score, we propose a Discriminative Local Descriptors Attention model that uses the ratio of intra-class and inter-class similarity to adaptively highlight the representative local descriptors without introducing any additional parameters, while most of the existing local descriptors based methods utilize the neural networks that inevitably involve the tedious parameter tuning. Experiments on four benchmark datasets show that our method achieves higher accuracy compared with the state-of-art approaches for few-shot learning. Specifically, our method is optimal on the CUB-200 dataset, and outperforms the second best competitive algorithm by 4.12(%) and 0.49(%) under the 5-way 1-shot and 5-way 5-shot settings, respectively.

图像分类的快速学习是计算机视觉领域的一个热门话题，其目的是从数量有限的标注图像中快速学习，并在新任务中实现泛化。现有的基于局部描述符的方法大多使用神经网络，不可避免地会涉及繁琐的参数调整，而本文受 Fisher Score 的思想启发，提出了一种 Discriminative Local Descriptors Attention 模型，利用类内和类间相似性的比率自适应地突出具有代表性的局部描述符，而无需引入任何额外参数。在四个基准数据集上进行的实验表明，我们的方法与最先进的少量学习方法相比具有更高的准确性。具体来说，我们的方法在CUB-200数据集上是最优的，在5路1-shot和5路5-shot设置下，分别比第二好的竞争算法高出4.12（%）和0.49（%）。

引用次数: 0