Deep learning has been used in many computer-vision-based applications. However, deep neural networks are vulnerable to adversarial examples that have been crafted specifically to fool a system while being imperceptible to humans. In this paper, we propose a detection defense method based on heterogeneous denoising on foreground and background (HDFB). Since an image region that dominates to the output classification is usually sensitive to adversarial perturbations, HDFB focuses defense on the foreground region rather than the whole image. First, HDFB uses class activation map to segment examples into foreground and background regions. Second, the foreground and background are encoded to square patches. Third, the encoded foreground is zoomed in and out and is denoised in two scales. Subsequently, the encoded background is denoised once using bilateral filtering. After that, the denoised foreground and background patches are decoded. Finally, the decoded foreground and background are stitched together as a denoised sample for classification. If the classifications of the denoised and input images are different, the input image is detected as an adversarial example. The comparison experiments are implemented on CIFAR-10 and MiniImageNet. The average detection rate (DR) against white-box attacks on the test sets of the two datasets is 86.4%. The average DR against black-box attacks on MiniImageNet is 88.4%. The experimental results suggest that HDFB shows high performance on adversarial examples and is robust against white-box and black-box adversarial attacks. However, HDFB is insecure if its defense parameters are exposed to attackers.
{"title":"An adversarial sample detection method based on heterogeneous denoising","authors":"Lifang Zhu, Chao Liu, Zhiqiang Zhang, Yifan Cheng, Biao Jie, Xintao Ding","doi":"10.1007/s00138-024-01579-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01579-3","url":null,"abstract":"<p>Deep learning has been used in many computer-vision-based applications. However, deep neural networks are vulnerable to adversarial examples that have been crafted specifically to fool a system while being imperceptible to humans. In this paper, we propose a detection defense method based on heterogeneous denoising on foreground and background (HDFB). Since an image region that dominates to the output classification is usually sensitive to adversarial perturbations, HDFB focuses defense on the foreground region rather than the whole image. First, HDFB uses class activation map to segment examples into foreground and background regions. Second, the foreground and background are encoded to square patches. Third, the encoded foreground is zoomed in and out and is denoised in two scales. Subsequently, the encoded background is denoised once using bilateral filtering. After that, the denoised foreground and background patches are decoded. Finally, the decoded foreground and background are stitched together as a denoised sample for classification. If the classifications of the denoised and input images are different, the input image is detected as an adversarial example. The comparison experiments are implemented on CIFAR-10 and MiniImageNet. The average detection rate (DR) against white-box attacks on the test sets of the two datasets is 86.4%. The average DR against black-box attacks on MiniImageNet is 88.4%. The experimental results suggest that HDFB shows high performance on adversarial examples and is robust against white-box and black-box adversarial attacks. However, HDFB is insecure if its defense parameters are exposed to attackers.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"30 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-08DOI: 10.1007/s00138-024-01577-5
Zhicheng Li, Chao Yang, Longyu Jiang
Feature pyramid network (FPN) improves object detection performance by means of top-down multilevel feature fusion. However, the current FPN-based methods have not effectively utilized the interlayer features to suppress the aliasing effects in the feature downward fusion process. We propose an interlayer attention feature pyramid network that attempts to integrate attention gates into FPN through interlayer enhancement to establish the correlation between context and model, thereby highlighting the salient region of each layer and suppressing the aliasing effects. Moreover, in order to avoid feature dilution in the feature downward fusion process and inability of multilayer features to utilize each other, simplified non-local algorithm is used in the multilayer fusion module to fuse and enhance the multiscale features. A comprehensive analysis of MS COCO and PASCAL VOC benchmarks demonstrate that our network achieves precise object localization and also outperforms current FPN-based object detection algorithms.
{"title":"IAFPN: interlayer enhancement and multilayer fusion network for object detection","authors":"Zhicheng Li, Chao Yang, Longyu Jiang","doi":"10.1007/s00138-024-01577-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01577-5","url":null,"abstract":"<p>Feature pyramid network (FPN) improves object detection performance by means of top-down multilevel feature fusion. However, the current FPN-based methods have not effectively utilized the interlayer features to suppress the aliasing effects in the feature downward fusion process. We propose an interlayer attention feature pyramid network that attempts to integrate attention gates into FPN through interlayer enhancement to establish the correlation between context and model, thereby highlighting the salient region of each layer and suppressing the aliasing effects. Moreover, in order to avoid feature dilution in the feature downward fusion process and inability of multilayer features to utilize each other, simplified non-local algorithm is used in the multilayer fusion module to fuse and enhance the multiscale features. A comprehensive analysis of MS COCO and PASCAL VOC benchmarks demonstrate that our network achieves precise object localization and also outperforms current FPN-based object detection algorithms.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"28 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-07DOI: 10.1007/s00138-024-01580-w
Mohana Murali Dasari, Rama Krishna Gorthi
Occlusion is a frequent phenomenon that hinders the task of visual object tracking. Since occlusion can be from any object and in any shape, data augmentation techniques will not greatly help identify or mitigate the tracker loss. Some of the existing works deal with occlusion only in an unsupervised manner. This paper proposes a generic deep learning framework for identifying occlusion in a given frame by formulating it as a supervised classification task for the first time. The proposed architecture introduces an “occlusion classification” branch into supervised trackers. This branch helps in the effective learning of features and also provides occlusion status for each frame. A metric is proposed to measure the performance of trackers under occlusion at frame level. The efficacy of the proposed framework is demonstrated on two supervised tracking paradigms: One is from the most commonly used Siamese region proposal class of trackers, and another from the emerging transformer-based trackers. This framework is tested on six diverse datasets (GOT-10k, LaSOT, OTB2015, TrackingNet, UAV123, and VOT2018), and it achieved significant improvements in performance over the corresponding baselines while performing on par with the state-of-the-art trackers. The contributions in this work are more generic, as any supervised tracker can easily adopt them.
{"title":"GOA-net: generic occlusion aware networks for visual tracking","authors":"Mohana Murali Dasari, Rama Krishna Gorthi","doi":"10.1007/s00138-024-01580-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01580-w","url":null,"abstract":"<p><i>Occlusion</i> is a frequent phenomenon that hinders the task of visual object tracking. Since occlusion can be from any object and in any shape, data augmentation techniques will not greatly help identify or mitigate the tracker loss. Some of the existing works deal with occlusion only in an unsupervised manner. This paper proposes a generic deep learning framework for identifying occlusion in a given frame by formulating it as a supervised classification task for the first time. The proposed architecture introduces an “occlusion classification” branch into supervised trackers. This branch helps in the effective learning of features and also provides occlusion status for each frame. A metric is proposed to measure the performance of trackers under occlusion at frame level. The efficacy of the proposed framework is demonstrated on two supervised tracking paradigms: One is from the most commonly used Siamese region proposal class of trackers, and another from the emerging transformer-based trackers. This framework is tested on six diverse datasets (GOT-10k, LaSOT, OTB2015, TrackingNet, UAV123, and VOT2018), and it achieved significant improvements in performance over the corresponding baselines while performing on par with the state-of-the-art trackers. The contributions in this work are more generic, as any supervised tracker can easily adopt them.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"38 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camera calibration is an essential prerequisite for road surveillance applications, which determines the accuracy of obtaining three-dimensional spatial information from surveillance video. The common practice for calibration is collecting the correspondences between the object points and their projections on surveillance, which usually needs to operate the calibrator manually. However, complex traffic and calibrator requirement limit the applicability of existing methods to road scenes. This paper proposes an online camera auto-calibration method for road surveillance to overcome the above problem. It constructs a large-scale virtual checkerboard adopting the road information from surveillance video, in which the structural size of the checkerboard can be easily obtained in advance because of the standardization for road design. The position coordinates of checkerboard corners are used for calibrating camera parameters, which is designed as a “coarse-to-fine” two-step procedure to recover the camera intrinsic and extrinsic parameters efficiently. Experimental results based on real datasets demonstrate that the proposed approach can accurately estimate camera parameters without manual involvement or additional information input. It achieves competitive effects on road surveillance auto-calibration while having lower requirements and computational costs than the automatic state-of-the-art.
{"title":"Online camera auto-calibration appliable to road surveillance","authors":"Shusen Guo, Xianwen Yu, Yuejin Sha, Yifan Ju, Mingchen Zhu, Jiafu Wang","doi":"10.1007/s00138-024-01576-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01576-6","url":null,"abstract":"<p>Camera calibration is an essential prerequisite for road surveillance applications, which determines the accuracy of obtaining three-dimensional spatial information from surveillance video. The common practice for calibration is collecting the correspondences between the object points and their projections on surveillance, which usually needs to operate the calibrator manually. However, complex traffic and calibrator requirement limit the applicability of existing methods to road scenes. This paper proposes an online camera auto-calibration method for road surveillance to overcome the above problem. It constructs a large-scale virtual checkerboard adopting the road information from surveillance video, in which the structural size of the checkerboard can be easily obtained in advance because of the standardization for road design. The position coordinates of checkerboard corners are used for calibrating camera parameters, which is designed as a “coarse-to-fine” two-step procedure to recover the camera intrinsic and extrinsic parameters efficiently. Experimental results based on real datasets demonstrate that the proposed approach can accurately estimate camera parameters without manual involvement or additional information input. It achieves competitive effects on road surveillance auto-calibration while having lower requirements and computational costs than the automatic state-of-the-art.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"25 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s00138-024-01575-7
Everett Fall, Kai-Wei Chang, Liang-Gee Chen
This paper presents an innovative approach that leverages a tree structure to effectively manage a large ensemble of neural networks for tackling complex video prediction tasks. Our proposed method introduces a novel technique for partitioning the function domain into simpler subsets, enabling piecewise learning by the ensemble. Seamlessly accessed by an accompanying tree structure with a time complexity of O(log(N)), this ensemble-tree framework progressively expands while training examples become more complex. The tree construction process incorporates a specialized algorithm that utilizes localized comparison functions, learned at each decision node. To evaluate the effectiveness of our method, we conducted experiments in two challenging scenarios: action-conditional video prediction in a 3D video game environment and error detection in real-world 3D printing scenarios. Our approach consistently outperformed existing methods by a significant margin across various experiments. Additionally, we introduce a new evaluation methodology for long-term video prediction tasks, which demonstrates improved alignment with qualitative observations. The results highlight the efficacy and superiority of our ensemble-tree approach in addressing complex video prediction challenges.
{"title":"Tree-managed network ensembles for video prediction","authors":"Everett Fall, Kai-Wei Chang, Liang-Gee Chen","doi":"10.1007/s00138-024-01575-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01575-7","url":null,"abstract":"<p>This paper presents an innovative approach that leverages a tree structure to effectively manage a large ensemble of neural networks for tackling complex video prediction tasks. Our proposed method introduces a novel technique for partitioning the function domain into simpler subsets, enabling piecewise learning by the ensemble. Seamlessly accessed by an accompanying tree structure with a time complexity of O(log(N)), this ensemble-tree framework progressively expands while training examples become more complex. The tree construction process incorporates a specialized algorithm that utilizes localized comparison functions, learned at each decision node. To evaluate the effectiveness of our method, we conducted experiments in two challenging scenarios: action-conditional video prediction in a 3D video game environment and error detection in real-world 3D printing scenarios. Our approach consistently outperformed existing methods by a significant margin across various experiments. Additionally, we introduce a new evaluation methodology for long-term video prediction tasks, which demonstrates improved alignment with qualitative observations. The results highlight the efficacy and superiority of our ensemble-tree approach in addressing complex video prediction challenges.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"29 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1007/s00138-024-01567-7
Alexandre Englebert, Olivier Cornu, Christophe De Vleeschouwer
The demand for explainable AI continues to rise alongside advancements in deep learning technology. Existing methods such as convolutional neural networks often struggle to accurately pinpoint the image features justifying a network’s prediction due to low-resolution saliency maps (e.g., CAM), smooth visualizations from perturbation-based techniques, or numerous isolated peaky spots in gradient-based approaches. In response, our work seeks to merge information from earlier and later layers within the network to create high-resolution class activation maps that not only maintain a level of competitiveness with previous art in terms of insertion-deletion faithfulness metrics but also significantly surpass it regarding the precision in localizing class-specific features.
{"title":"Poly-cam: high resolution class activation map for convolutional neural networks","authors":"Alexandre Englebert, Olivier Cornu, Christophe De Vleeschouwer","doi":"10.1007/s00138-024-01567-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01567-7","url":null,"abstract":"<p>The demand for explainable AI continues to rise alongside advancements in deep learning technology. Existing methods such as convolutional neural networks often struggle to accurately pinpoint the image features justifying a network’s prediction due to low-resolution saliency maps (e.g., CAM), smooth visualizations from perturbation-based techniques, or numerous isolated peaky spots in gradient-based approaches. In response, our work seeks to merge information from earlier and later layers within the network to create high-resolution class activation maps that not only maintain a level of competitiveness with previous art in terms of insertion-deletion faithfulness metrics but also significantly surpass it regarding the precision in localizing class-specific features.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"37 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1007/s00138-024-01571-x
Xi Chen, Huang Wei, Wei Guo, Fan Zhang, Jiayu Du, Zhizhong Zhou
Deep learning models have been shown to be vulnerable to critical attacks under adversarial conditions. Attackers are able to generate powerful adversarial examples by searching for adversarial perturbations, without interfering with model training or directly modifying the model. This phenomenon indicates an endogenous problem in existing deep learning frameworks. Therefore, optimizing individual models for defense is often limited and can always be defeated by new attack methods. Ensemble defense has been shown to be effective in defending against adversarial attacks by combining diverse models. However, the problem of insufficient differentiation among existing models persists. Active defense in cyberspace security has successfully defended against unknown vulnerabilities by integrating subsystems with multiple different implementations to achieve a unified mission objective. Inspired by this, we propose exploring the feasibility of achieving model differentiation by changing the data features used in training individual models, as they are the core factor of functional implementation. We utilize several feature extraction methods to preprocess the data and train differentiated models based on these features. By generating adversarial perturbations to attack different models, we demonstrate that the feature representation of the data is highly resistant to adversarial perturbations. The entire ensemble is able to operate normally in an error-bearing environment.
{"title":"Adversarial defence by learning differentiated feature representation in deep ensemble","authors":"Xi Chen, Huang Wei, Wei Guo, Fan Zhang, Jiayu Du, Zhizhong Zhou","doi":"10.1007/s00138-024-01571-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01571-x","url":null,"abstract":"<p>Deep learning models have been shown to be vulnerable to critical attacks under adversarial conditions. Attackers are able to generate powerful adversarial examples by searching for adversarial perturbations, without interfering with model training or directly modifying the model. This phenomenon indicates an endogenous problem in existing deep learning frameworks. Therefore, optimizing individual models for defense is often limited and can always be defeated by new attack methods. Ensemble defense has been shown to be effective in defending against adversarial attacks by combining diverse models. However, the problem of insufficient differentiation among existing models persists. Active defense in cyberspace security has successfully defended against unknown vulnerabilities by integrating subsystems with multiple different implementations to achieve a unified mission objective. Inspired by this, we propose exploring the feasibility of achieving model differentiation by changing the data features used in training individual models, as they are the core factor of functional implementation. We utilize several feature extraction methods to preprocess the data and train differentiated models based on these features. By generating adversarial perturbations to attack different models, we demonstrate that the feature representation of the data is highly resistant to adversarial perturbations. The entire ensemble is able to operate normally in an error-bearing environment.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"16 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1007/s00138-024-01573-9
Sheikh Shah Mohammad Motiur Rahman, Michel Salomon, Sounkalo Dembélé
Scanning electron microscope (SEM) enables imaging of micro-nano scale objects. It is an analytical tool widely used in the material, earth and life sciences. However, SEM images often suffer from high noise levels, influenced by factors such as dwell time, the time during which the electron beam remains per pixel during acquisition. Slower dwell times reduce noise but risk damaging the sample, while faster ones introduce uncertainty. To this end, the latest state-of-the-art denoising techniques must be explored. Experimentation is crucial to identify the most effective methods that balance noise reduction and sample preservation, ensuring high-quality SEM images with enhanced clarity and accuracy. A thorough analysis tracing the evolution of image denoising techniques was conducted, ranging from classical methods to deep learning approaches. A comprehensive taxonomy of this reverse problem solutions was established, detailing the developmental flow of these methods. Subsequently, the latest state-of-the-art techniques were identified and reviewed based on their reproducibility and the public availability of their source code. The selected techniques were then tested and investigated using scanning electron microscope images. After in-depth analysis and benchmarking, it is clear that the existing deep learning-based denoising techniques fall short in maintaining a balance between noise reduction and preserving crucial information for SEM images. Issues like information removal and over-smoothing have been identified. To address these constraints, there is a critical need for the development of SEM image denoising techniques that prioritize both noise reduction and information preservation. Additionally, one can see that the combination of several networks, such as the generative adversarial network and the convolutional neural network (CNN), known as BoostNet, or the vision transformer and the CNN, known as SCUNet, improves denoising performance. It is recommended to use blind techniques to denoise real noise while taking into account detail preservation and tackling excessive smoothing, particularly in the context of SEM. In the future the use of explainable AI will facilitate the debugging and the identification of these problems.
扫描电子显微镜(SEM)可对微纳米级物体进行成像。它是一种广泛应用于材料科学、地球科学和生命科学的分析工具。然而,扫描电子显微镜图像往往受驻留时间(即采集时电子束在每个像素上停留的时间)等因素的影响而出现高噪声。较慢的停留时间会降低噪声,但有可能损坏样品,而较快的停留时间则会带来不确定性。为此,必须探索最新的去噪技术。实验对于确定最有效的方法至关重要,这些方法能在减少噪音和保护样品之间取得平衡,确保高质量的扫描电镜图像具有更高的清晰度和准确性。从经典方法到深度学习方法,我们对图像去噪技术的演变进行了深入分析。对这种反向问题解决方案进行了全面分类,详细介绍了这些方法的发展流程。随后,根据这些技术的可重复性及其源代码的公开性,确定并审查了最新的先进技术。然后,利用扫描电子显微镜图像对所选技术进行了测试和研究。经过深入分析和基准测试后发现,现有的基于深度学习的去噪技术在保持 SEM 图像降噪和保留关键信息之间的平衡方面存在不足。已经发现了信息去除和过度平滑等问题。为了解决这些制约因素,亟需开发同时优先考虑降噪和保存信息的 SEM 图像去噪技术。此外,我们还可以看到,将生成式对抗网络和卷积神经网络(称为 BoostNet)或视觉转换器和卷积神经网络(称为 SCUNet)等多个网络结合起来,可以提高去噪性能。建议使用盲技术对真实噪声进行去噪,同时考虑到细节保留和解决过度平滑问题,特别是在 SEM 的情况下。未来,使用可解释的人工智能将有助于调试和识别这些问题。
{"title":"Towards scanning electron microscopy image denoising: a state-of-the-art overview, benchmark, taxonomies, and future direction","authors":"Sheikh Shah Mohammad Motiur Rahman, Michel Salomon, Sounkalo Dembélé","doi":"10.1007/s00138-024-01573-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01573-9","url":null,"abstract":"<p>Scanning electron microscope (SEM) enables imaging of micro-nano scale objects. It is an analytical tool widely used in the material, earth and life sciences. However, SEM images often suffer from high noise levels, influenced by factors such as dwell time, the time during which the electron beam remains per pixel during acquisition. Slower dwell times reduce noise but risk damaging the sample, while faster ones introduce uncertainty. To this end, the latest state-of-the-art denoising techniques must be explored. Experimentation is crucial to identify the most effective methods that balance noise reduction and sample preservation, ensuring high-quality SEM images with enhanced clarity and accuracy. A thorough analysis tracing the evolution of image denoising techniques was conducted, ranging from classical methods to deep learning approaches. A comprehensive taxonomy of this reverse problem solutions was established, detailing the developmental flow of these methods. Subsequently, the latest state-of-the-art techniques were identified and reviewed based on their reproducibility and the public availability of their source code. The selected techniques were then tested and investigated using scanning electron microscope images. After in-depth analysis and benchmarking, it is clear that the existing deep learning-based denoising techniques fall short in maintaining a balance between noise reduction and preserving crucial information for SEM images. Issues like information removal and over-smoothing have been identified. To address these constraints, there is a critical need for the development of SEM image denoising techniques that prioritize both noise reduction and information preservation. Additionally, one can see that the combination of several networks, such as the generative adversarial network and the convolutional neural network (CNN), known as BoostNet, or the vision transformer and the CNN, known as SCUNet, improves denoising performance. It is recommended to use blind techniques to denoise real noise while taking into account detail preservation and tackling excessive smoothing, particularly in the context of SEM. In the future the use of explainable AI will facilitate the debugging and the identification of these problems.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"28 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-27DOI: 10.1007/s00138-024-01570-y
Dimitrios Banelas, Euripides G. M. Petrakis
MotionInsights facilitates object detection and tracking from multiple video streams in real-time. Leveraging the distributed stream processing capabilities of Apache Flink and Apache Kafka (as an intermediate message broker), the system models video processing as a data flow stream processing pipeline. Each video frame is split into smaller blocks, which are dispatched to be processed in parallel by a number of Flink operators. In the first stage, each block undergoes background subtraction and component labeling. The connected components from each frame are grouped, and the eligible components are merged into objects. In the last stage of the pipeline, all objects from each frame are concentrated to produce the trajectory of each object. The Flink application is deployed as a Kubernetes cluster in the Google Cloud Platform. Experimenting in a Flink cluster with 7 machines, revealed that MotionInsights achieves up to 6 times speedup compared to a monolithic (nonparallel) implementation while providing accurate trajectory patterns. The highest (i.e., more than 6 times) speed-up was observed with video streams of the highest resolution. Compared to existing systems that use custom or proprietary architectures, MotionInsights is independent of the underlying hardware platform and can be deployed on common CPU architectures and the cloud.
{"title":"Motioninsights: real-time object tracking in streaming video","authors":"Dimitrios Banelas, Euripides G. M. Petrakis","doi":"10.1007/s00138-024-01570-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01570-y","url":null,"abstract":"<p>MotionInsights facilitates object detection and tracking from multiple video streams in real-time. Leveraging the distributed stream processing capabilities of Apache Flink and Apache Kafka (as an intermediate message broker), the system models video processing as a data flow stream processing pipeline. Each video frame is split into smaller blocks, which are dispatched to be processed in parallel by a number of Flink operators. In the first stage, each block undergoes background subtraction and component labeling. The connected components from each frame are grouped, and the eligible components are merged into objects. In the last stage of the pipeline, all objects from each frame are concentrated to produce the trajectory of each object. The Flink application is deployed as a Kubernetes cluster in the Google Cloud Platform. Experimenting in a Flink cluster with 7 machines, revealed that MotionInsights achieves up to 6 times speedup compared to a monolithic (nonparallel) implementation while providing accurate trajectory patterns. The highest (i.e., more than 6 times) speed-up was observed with video streams of the highest resolution. Compared to existing systems that use custom or proprietary architectures, MotionInsights is independent of the underlying hardware platform and can be deployed on common CPU architectures and the cloud.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gesture recognition, having multitudinous applications in the real world, is one of the core areas of research in the field of human-computer interaction. In this paper, we propose a novel method for isolated and continuous hand gesture recognition utilizing the movement epenthesis detection and removal. For this purpose, the present work detects and removes the movement epenthesis frames from the isolated and continuous hand gesture videos. In this paper, we have also proposed a novel modality based on the temporal difference that extracts hand regions, removes gesture irrelevant factors and provides temporal information contained in the hand gesture videos. Using the proposed modality and other modalities such as the RGB modality, depth modality and segmented hand modality, features are extracted using Googlenet Caffe Model. Next, we derive a set of discriminative features by fusing the acquired features that form a feature vector representing the sign gesture in question. We have designed and used a Bidirectional Long Short-Term Memory Network (Bi-LSTM) for classification purpose. To test the efficacy of our proposed work, we applied our method on various publicly available continuous and isolated hand gesture datasets like ChaLearn LAP IsoGD, ChaLearn LAP ConGD, IPN Hand, and NVGesture. We observe in our experiments that our proposed method performs exceptionally well with several individual modalities as well as combination of modalities of these datasets. The combined effect of the proposed modality and movement epenthesis frames removal led to significant improvement in gesture recognition accuracy and considerable reduction in computational burden. Thus the obtained results advocate our proposed approach to be at par with the existing state-of-the-art methods.
手势识别在现实世界中应用广泛,是人机交互领域的核心研究领域之一。在本文中,我们提出了一种利用运动外显检测和移除进行孤立和连续手势识别的新方法。为此,本文从孤立和连续手势视频中检测并移除运动外显帧。在本文中,我们还提出了一种基于时间差的新型模态,该模态可提取手部区域、去除手势无关因素并提供手势视频中包含的时间信息。利用提出的模态和其他模态(如 RGB 模态、深度模态和分割手部模态),我们使用 Googlenet Caffe 模型提取了特征。接下来,我们通过融合所获得的特征,形成代表相关手势的特征向量,从而得出一组判别特征。我们设计并使用了双向长短期记忆网络(Bi-LSTM)进行分类。为了测试我们提出的方法的有效性,我们在各种公开的连续和孤立手势数据集上应用了我们的方法,如 ChaLearn LAP IsoGD、ChaLearn LAP ConGD、IPN Hand 和 NVGesture。我们在实验中观察到,我们提出的方法在这些数据集的几种单独模态和模态组合中都表现出色。所提议的模式和运动外显帧移除的综合效应显著提高了手势识别的准确性,并大大减轻了计算负担。因此,所获得的结果表明,我们提出的方法与现有的最先进方法不相上下。
{"title":"A multi-modal framework for continuous and isolated hand gesture recognition utilizing movement epenthesis detection","authors":"Navneet Nayan, Debashis Ghosh, Pyari Mohan Pradhan","doi":"10.1007/s00138-024-01565-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01565-9","url":null,"abstract":"<p>Gesture recognition, having multitudinous applications in the real world, is one of the core areas of research in the field of human-computer interaction. In this paper, we propose a novel method for isolated and continuous hand gesture recognition utilizing the movement epenthesis detection and removal. For this purpose, the present work detects and removes the movement epenthesis frames from the isolated and continuous hand gesture videos. In this paper, we have also proposed a novel modality based on the temporal difference that extracts hand regions, removes gesture irrelevant factors and provides temporal information contained in the hand gesture videos. Using the proposed modality and other modalities such as the RGB modality, depth modality and segmented hand modality, features are extracted using Googlenet Caffe Model. Next, we derive a set of discriminative features by fusing the acquired features that form a feature vector representing the sign gesture in question. We have designed and used a Bidirectional Long Short-Term Memory Network (Bi-LSTM) for classification purpose. To test the efficacy of our proposed work, we applied our method on various publicly available continuous and isolated hand gesture datasets like ChaLearn LAP IsoGD, ChaLearn LAP ConGD, IPN Hand, and NVGesture. We observe in our experiments that our proposed method performs exceptionally well with several individual modalities as well as combination of modalities of these datasets. The combined effect of the proposed modality and movement epenthesis frames removal led to significant improvement in gesture recognition accuracy and considerable reduction in computational burden. Thus the obtained results advocate our proposed approach to be at par with the existing state-of-the-art methods.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"12 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}