Pub Date : 2024-08-01DOI: 10.1007/s00138-024-01585-5
Sean Wu, Naoki Kaneko, David S. Liebeskind, Fabien Scalzo
3D reconstruction of biplane cerebral angiograms remains a challenging, unsolved research problem due to the loss of depth information and the unknown pixelwise correlation between input images. The occlusions arising from only two views complicate the reconstruction of fine vessel details and the simultaneous addressing of inherent missing information. In this paper, we take an incremental step toward solving this problem by reconstructing the corresponding 2D slice of the cerebral angiogram using biplane 1D image data. We developed a coordinate-based neural network that encodes the 1D image data along with a deterministic Fourier feature mapping from a given input point, resulting in a slice reconstruction that is more spatially accurate. Using only one 1D row of biplane image data, our Fourier feature network reconstructed the corresponding volume slices with a peak signal-to-noise ratio (PSNR) of 26.32 ± 0.36, a structural similarity index measure (SSIM) of 61.38 ± 1.79, a mean squared error (MSE) of 0.0023 ± 0.0002, and a mean absolute error (MAE) of 0.0364 ± 0.0029. Our research has implications for future work aimed at improving backprojection-based reconstruction by first examining individual slices from 1D information as a prerequisite.
{"title":"Fourier feature network for 3D vessel reconstruction from biplane angiograms","authors":"Sean Wu, Naoki Kaneko, David S. Liebeskind, Fabien Scalzo","doi":"10.1007/s00138-024-01585-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01585-5","url":null,"abstract":"<p>3D reconstruction of biplane cerebral angiograms remains a challenging, unsolved research problem due to the loss of depth information and the unknown pixelwise correlation between input images. The occlusions arising from only two views complicate the reconstruction of fine vessel details and the simultaneous addressing of inherent missing information. In this paper, we take an incremental step toward solving this problem by reconstructing the corresponding 2D slice of the cerebral angiogram using biplane 1D image data. We developed a coordinate-based neural network that encodes the 1D image data along with a deterministic Fourier feature mapping from a given input point, resulting in a slice reconstruction that is more spatially accurate. Using only one 1D row of biplane image data, our Fourier feature network reconstructed the corresponding volume slices with a peak signal-to-noise ratio (PSNR) of 26.32 ± 0.36, a structural similarity index measure (SSIM) of 61.38 ± 1.79, a mean squared error (MSE) of 0.0023 ± 0.0002, and a mean absolute error (MAE) of 0.0364 ± 0.0029. Our research has implications for future work aimed at improving backprojection-based reconstruction by first examining individual slices from 1D information as a prerequisite.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"46 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s00138-024-01581-9
Yizhe Xia, Hongjuan Zhang
Metric learning focuses on finding similarities between data and aims to enlarge the distance between the samples with different labels. This work proposes a semi-supervised metric learning method based on the point-to-class structure of the labeled data, which is computationally less expensive, especially than using point-to-point structure. Specifically, the point-to-class structure is formulated into a new triplet constraint, which could narrow the distance of inner-class data and enlarge the distance of inter-class data simultaneously. Moreover, for measuring dissimilarity between different classes, weights are introduced into the triplet constraint and forms the weighted triplet constraint. Then, two kinds of regularizers such as spatial regularizer are rationally incorporated respectively in this model to mitigate the overfitting phenomenon and preserve the topological structure of the data. Furthermore, Riemannian gradient descent algorithm is adopted to solve the proposed model, since it can fully exploit the geometric structure of Riemannian manifolds and the proposed model can be regarded as a generalization of the unconstrained optimization problem in Euclidean space on Riemannian manifold. By introducing such solution strategy, the variables are constrained to a specific Riemannian manifold in each step of the iterative solution process, thereby enabling efficient and accurate model resolution. Finally, we conduct classification experiments on various data sets and compare the classification performance to state-of-the-art methods. The experimental results demonstrate that our proposed method has better performance in classification, especially for hyperspectral image data.
{"title":"Semi-supervised metric learning incorporating weighted triplet constraint and Riemannian manifold optimization for classification","authors":"Yizhe Xia, Hongjuan Zhang","doi":"10.1007/s00138-024-01581-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01581-9","url":null,"abstract":"<p>Metric learning focuses on finding similarities between data and aims to enlarge the distance between the samples with different labels. This work proposes a semi-supervised metric learning method based on the point-to-class structure of the labeled data, which is computationally less expensive, especially than using point-to-point structure. Specifically, the point-to-class structure is formulated into a new triplet constraint, which could narrow the distance of inner-class data and enlarge the distance of inter-class data simultaneously. Moreover, for measuring dissimilarity between different classes, weights are introduced into the triplet constraint and forms the weighted triplet constraint. Then, two kinds of regularizers such as spatial regularizer are rationally incorporated respectively in this model to mitigate the overfitting phenomenon and preserve the topological structure of the data. Furthermore, Riemannian gradient descent algorithm is adopted to solve the proposed model, since it can fully exploit the geometric structure of Riemannian manifolds and the proposed model can be regarded as a generalization of the unconstrained optimization problem in Euclidean space on Riemannian manifold. By introducing such solution strategy, the variables are constrained to a specific Riemannian manifold in each step of the iterative solution process, thereby enabling efficient and accurate model resolution. Finally, we conduct classification experiments on various data sets and compare the classification performance to state-of-the-art methods. The experimental results demonstrate that our proposed method has better performance in classification, especially for hyperspectral image data.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"60 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141777912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s00138-024-01587-3
Yang Yang, Shangqin Yang, Qi Zhao, Honghui Cao, Xinjie Peng
Long-term corrosion and external disturbances can lead to defects in sewer pipes, which threaten important parts of urban infrastructure. The automatic defect detection algorithm based on closed-circuit televisions (CCTV) has gradually matured using supervised deep learning. However, there are different types and sizes of sewer pipe defects, and relying on human inspection to detect defects is time-consuming and subjective. Therefore, a few-shot, accurate and automatic method for sewer pipe defect with localization and fine-grained classification is needed. Thus, this study constructs a few-shot image-level dataset of 15 categories using the sewer dataset ML-Sewer and then presents a collaborative localization network based on weakly supervised learning to automatically classify and detect defects. Specifically, an attention refinement module (ARM) is designed to obtain classification results and high-level semantic features. Furthermore, considering the correlation between target regions and the extraction of target edge information, we designed a collaborative localization module (CLM) consisting of two branches. Then, to ensure that the network focuses on the complete target area, this study applies an image iteration module (IIM). Finally, the results of the two branches in the CLM are fused to acquire target localization. The experimental results show that the proposed model exhibits favorable performance in detecting sewer pipe defects. The proposed method exhibits prediction classification accuracy that reaches 69.76(%) and a positioning accuracy rate that reaches 65.32(%), which is higher than the performances of other weakly supervised detection models in sewer pipe defect detection.
{"title":"Weakly supervised collaborative localization learning method for sewer pipe defect detection","authors":"Yang Yang, Shangqin Yang, Qi Zhao, Honghui Cao, Xinjie Peng","doi":"10.1007/s00138-024-01587-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01587-3","url":null,"abstract":"<p>Long-term corrosion and external disturbances can lead to defects in sewer pipes, which threaten important parts of urban infrastructure. The automatic defect detection algorithm based on closed-circuit televisions (CCTV) has gradually matured using supervised deep learning. However, there are different types and sizes of sewer pipe defects, and relying on human inspection to detect defects is time-consuming and subjective. Therefore, a few-shot, accurate and automatic method for sewer pipe defect with localization and fine-grained classification is needed. Thus, this study constructs a few-shot image-level dataset of 15 categories using the sewer dataset ML-Sewer and then presents a collaborative localization network based on weakly supervised learning to automatically classify and detect defects. Specifically, an attention refinement module (ARM) is designed to obtain classification results and high-level semantic features. Furthermore, considering the correlation between target regions and the extraction of target edge information, we designed a collaborative localization module (CLM) consisting of two branches. Then, to ensure that the network focuses on the complete target area, this study applies an image iteration module (IIM). Finally, the results of the two branches in the CLM are fused to acquire target localization. The experimental results show that the proposed model exhibits favorable performance in detecting sewer pipe defects. The proposed method exhibits prediction classification accuracy that reaches 69.76<span>(%)</span> and a positioning accuracy rate that reaches 65.32<span>(%)</span>, which is higher than the performances of other weakly supervised detection models in sewer pipe defect detection.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"10 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141777914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s00138-024-01582-8
Haiyang Wang, Songwei Wang, Longlong Qian
The Lobular Giant Motion Detector (LGMD) is a neuron in the insect visual system that has been extensively studied, especially in locusts. This neuron is highly sensitive to rapidly approaching objects, allowing insects to react quickly to avoid potential threats such as approaching predators or obstacles. In the realm of intelligent vehicles, due to the lack of performance of conventional RGB cameras in extreme light conditions or at high-speed movements. Inspired by biological mechanisms, we have developed a novel neuromorphic dynamic vision sensor (DVS) driven LGMD spiking neural network (SNN) model. SNNs, distinguished by their bio-inspired spiking dynamics, offer a unique advantage in processing time-varying visual data, particularly in scenarios where rapid response and energy efficiency are paramount. Our model incorporates two distinct types of Leaky Integrate-and-Fire (LIF) neuron models and synapse models, which have been instrumental in reducing network latency and enhancing the system’s reaction speed. And addressing the challenge of noise in event streams, we have implemented denoising techniques to ensure the integrity of the input data. Integrating the proposed methods, ultimately, the model was integrated into an intelligent vehicle to conduct real-time obstacle avoidance testing in response to looming objects in simulated real scenarios. The experimental results show that the model’s ability to compensate for the limitations of traditional RGB cameras in detecting looming targets in the dark, and can detect looming targets and implement effective obstacle avoidance in complex and diverse dark environments.
{"title":"An insect vision-inspired neuromorphic vision systems in low-light obstacle avoidance for intelligent vehicles","authors":"Haiyang Wang, Songwei Wang, Longlong Qian","doi":"10.1007/s00138-024-01582-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01582-8","url":null,"abstract":"<p>The Lobular Giant Motion Detector (LGMD) is a neuron in the insect visual system that has been extensively studied, especially in locusts. This neuron is highly sensitive to rapidly approaching objects, allowing insects to react quickly to avoid potential threats such as approaching predators or obstacles. In the realm of intelligent vehicles, due to the lack of performance of conventional RGB cameras in extreme light conditions or at high-speed movements. Inspired by biological mechanisms, we have developed a novel neuromorphic dynamic vision sensor (DVS) driven LGMD spiking neural network (SNN) model. SNNs, distinguished by their bio-inspired spiking dynamics, offer a unique advantage in processing time-varying visual data, particularly in scenarios where rapid response and energy efficiency are paramount. Our model incorporates two distinct types of Leaky Integrate-and-Fire (LIF) neuron models and synapse models, which have been instrumental in reducing network latency and enhancing the system’s reaction speed. And addressing the challenge of noise in event streams, we have implemented denoising techniques to ensure the integrity of the input data. Integrating the proposed methods, ultimately, the model was integrated into an intelligent vehicle to conduct real-time obstacle avoidance testing in response to looming objects in simulated real scenarios. The experimental results show that the model’s ability to compensate for the limitations of traditional RGB cameras in detecting looming targets in the dark, and can detect looming targets and implement effective obstacle avoidance in complex and diverse dark environments.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141777913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-20DOI: 10.1007/s00138-024-01561-z
Naga V. S. Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu
In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces Recognize Every Action Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.
在计算机视觉领域,群体活动识别(GAR)发挥着至关重要的作用,可应用于体育视频分析、监控和社会场景理解。本文介绍了 "一次识别所有地方的所有动作"(REACT),这是一种新颖的架构,旨在为视频中复杂的上下文关系建模。REACT 利用先进的基于变压器的模型来编码复杂的上下文关系,从而增强对群体动态的理解。综合视觉语言编码有助于有效捕捉时空互动和多模态信息,从而实现全面的场景理解。该模型的精确动作定位功能可完善对文本和视频数据的共同理解,从而实现精确的边界框检索,并增强文本描述与视觉现实之间的语义联系。特定演员融合在特定演员细节和上下文信息之间取得了平衡,提高了模型的特定性和识别群体活动的鲁棒性。实验结果表明,与最先进的 GAR 方法相比,REACT 在识别和理解不同数据集的群体活动方面具有更高的准确性。这项工作极大地推动了群体活动识别,为细致入微的场景理解提供了一个强大的框架。
{"title":"React: recognize every action everywhere all at once","authors":"Naga V. S. Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu","doi":"10.1007/s00138-024-01561-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01561-z","url":null,"abstract":"<p>In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces <b>R</b>ecognize <b>E</b>very <b>Act</b>ion Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"24 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141743258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-11DOI: 10.1007/s00138-024-01559-7
Guanzhou Ji, Azadeh O. Sawyer, Srinivasa G. Narasimhan
Virtual staging technique can digitally showcase a variety of real-world scenes. However, relighting indoor scenes from a single image is challenging due to unknown scene geometry, material properties, and outdoor spatially-varying lighting. In this study, we use the High Dynamic Range (HDR) technique to capture an indoor panorama and its paired outdoor hemispherical photograph, and we develop a novel inverse rendering approach for scene relighting and editing. Our method consists of four key components: (1) panoramic furniture detection and removal, (2) automatic floor layout design, (3) global rendering with scene geometry, new furniture objects, and the real-time outdoor photograph, and (4) virtual staging with new camera position, outdoor illumination, scene texture, and electrical light. The results demonstrate that a single indoor panorama can be used to generate high-quality virtual scenes under new environmental conditions. Additionally, we contribute a new calibrated HDR (Cali-HDR) dataset that consists of 137 paired indoor and outdoor photographs. The animation for virtual rendered scenes is available here.
{"title":"Virtual home staging and relighting from a single panorama under natural illumination","authors":"Guanzhou Ji, Azadeh O. Sawyer, Srinivasa G. Narasimhan","doi":"10.1007/s00138-024-01559-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01559-7","url":null,"abstract":"<p>Virtual staging technique can digitally showcase a variety of real-world scenes. However, relighting indoor scenes from a single image is challenging due to unknown scene geometry, material properties, and outdoor spatially-varying lighting. In this study, we use the High Dynamic Range (HDR) technique to capture an indoor panorama and its paired outdoor hemispherical photograph, and we develop a novel inverse rendering approach for scene relighting and editing. Our method consists of four key components: (1) panoramic furniture detection and removal, (2) automatic floor layout design, (3) global rendering with scene geometry, new furniture objects, and the real-time outdoor photograph, and (4) virtual staging with new camera position, outdoor illumination, scene texture, and electrical light. The results demonstrate that a single indoor panorama can be used to generate high-quality virtual scenes under new environmental conditions. Additionally, we contribute a new calibrated HDR (Cali-HDR) dataset that consists of 137 paired indoor and outdoor photographs. The animation for virtual rendered scenes is available here.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"32 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141584987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-11DOI: 10.1007/s00138-024-01574-8
Luis Gonzalez-Naharro, M. Julia Flores, Jesus Martínez-Gómez, Jose M. Puerta
Data augmentation is widely applied in various computer vision problems for artificially increasing the size of a dataset by transforming the original data. These techniques are employed in small datasets to prevent overfitting, and also in problems where labelling is difficult. Nevertheless, data augmentation assumes that transformations preserve groundtruth labels, something not true for subjective problems such as aesthetic quality assessment, in which image transformations can alter their aesthetic quality groundtruth. In this work, we study how data augmentation affects subjective problems. We train a series of models, changing the probability of augmenting images and the intensity of such augmentations. We train models on AVA for quality prediction, on Photozilla for photo style prediction, and on subjective and objective labels of CelebA. Results show that subjective tasks get worse results than objective tasks with traditional augmentation techniques, and this worsening depends on the specific type of subjectivity.
{"title":"Evaluation of data augmentation techniques on subjective tasks","authors":"Luis Gonzalez-Naharro, M. Julia Flores, Jesus Martínez-Gómez, Jose M. Puerta","doi":"10.1007/s00138-024-01574-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01574-8","url":null,"abstract":"<p>Data augmentation is widely applied in various computer vision problems for artificially increasing the size of a dataset by transforming the original data. These techniques are employed in small datasets to prevent overfitting, and also in problems where labelling is difficult. Nevertheless, data augmentation assumes that transformations preserve groundtruth labels, something not true for subjective problems such as aesthetic quality assessment, in which image transformations can alter their aesthetic quality groundtruth. In this work, we study how data augmentation affects subjective problems. We train a series of models, changing the probability of augmenting images and the intensity of such augmentations. We train models on AVA for quality prediction, on Photozilla for photo style prediction, and on subjective and objective labels of CelebA. Results show that subjective tasks get worse results than objective tasks with traditional augmentation techniques, and this worsening depends on the specific type of subjectivity.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"59 Pt A 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-10DOI: 10.1007/s00138-024-01572-w
Ozan Bahadir, Jan Paul Siebert, Gerardo Aragon-Camarasa
This study addresses the problem of hand–eye calibration in robotic systems by developing Continual Learning (CL)-based approaches. Traditionally, robots require explicit models to transfer knowledge from camera observations to their hands or base. However, this poses limitations, as the hand–eye calibration parameters are typically valid only for the current camera configuration. We, therefore, propose a flexible and autonomous hand–eye calibration system that can adapt to changes in camera pose over time. Three CL-based approaches are introduced: the naive CL approach, the reservoir rehearsal approach, and the hybrid approach combining reservoir sampling with new data evaluation. The naive CL approach suffers from catastrophic forgetting, while the reservoir rehearsal approach mitigates this issue by sampling uniformly from past data. The hybrid approach further enhances performance by incorporating reservoir sampling and assessing new data for novelty. Experiments conducted in simulated and real-world environments demonstrate that the CL-based approaches, except for the naive approach, achieve competitive performance compared to traditional batch learning-based methods. This suggests that treating hand–eye calibration as a time sequence problem enables the extension of the learned space without complete retraining. The adaptability of the CL-based approaches facilitates accommodating changes in camera pose, leading to an improved hand–eye calibration system.
{"title":"Continual learning approaches to hand–eye calibration in robots","authors":"Ozan Bahadir, Jan Paul Siebert, Gerardo Aragon-Camarasa","doi":"10.1007/s00138-024-01572-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01572-w","url":null,"abstract":"<p>This study addresses the problem of hand–eye calibration in robotic systems by developing Continual Learning (CL)-based approaches. Traditionally, robots require explicit models to transfer knowledge from camera observations to their hands or base. However, this poses limitations, as the hand–eye calibration parameters are typically valid only for the current camera configuration. We, therefore, propose a flexible and autonomous hand–eye calibration system that can adapt to changes in camera pose over time. Three CL-based approaches are introduced: the naive CL approach, the reservoir rehearsal approach, and the hybrid approach combining reservoir sampling with new data evaluation. The naive CL approach suffers from catastrophic forgetting, while the reservoir rehearsal approach mitigates this issue by sampling uniformly from past data. The hybrid approach further enhances performance by incorporating reservoir sampling and assessing new data for novelty. Experiments conducted in simulated and real-world environments demonstrate that the CL-based approaches, except for the naive approach, achieve competitive performance compared to traditional batch learning-based methods. This suggests that treating hand–eye calibration as a time sequence problem enables the extension of the learned space without complete retraining. The adaptability of the CL-based approaches facilitates accommodating changes in camera pose, leading to an improved hand–eye calibration system.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"41 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141584984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of this study is to reconstruct a high-quality computed tomography (CT) image from low-dose acquisition using an unrolling deep learning-based reconstruction network with less computational complexity and a more generalized model. We propose a MDUNet: Multi-parameters deep-prior unrolling network that employs the cascaded convolutional and deconvolutional blocks to unroll the model-based iterative reconstruction within a finite number of iterations by data-driven training. Furthermore, the embedded data consistency constraint in MDUNet ensures that the input low-dose images and the low-dose sinograms are consistent as well as incorporate the physics imaging geometry. Additionally, multi-parameter training was employed to enhance the model's generalization during the training process. Experimental results based on AAPM Low-dose CT datasets show that the proposed MDUNet significantly outperforms other state-of-the-art (SOTA) methods quantitatively and qualitatively. Also, the cascaded blocks reduce the computational complexity with reduced training parameters and generalize well on different datasets. In addition, the proposed MDUNet is validated on 8 different organs of interest, with more detailed structures recovered and high-quality images generated. The experimental results demonstrate that the proposed MDUNet generates favorable improvement over other competing methods in terms of visual quality, quantitative performance, and computational efficiency. The MDUNet has improved image quality with reduced computational cost and good generalization which effectively lowers radiation dose and reduces scanning time, making it favorable for future clinical deployment.
{"title":"MDUNet: deep-prior unrolling network with multi-parameter data integration for low-dose computed tomography reconstruction","authors":"Temitope Emmanuel Komolafe, Nizhuan Wang, Yuchi Tian, Adegbola Oyedotun Adeniji, Liang Zhou","doi":"10.1007/s00138-024-01568-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01568-6","url":null,"abstract":"<p>The goal of this study is to reconstruct a high-quality computed tomography (CT) image from low-dose acquisition using an unrolling deep learning-based reconstruction network with less computational complexity and a more generalized model. We propose a MDUNet: Multi-parameters deep-prior unrolling network that employs the cascaded convolutional and deconvolutional blocks to unroll the model-based iterative reconstruction within a finite number of iterations by data-driven training. Furthermore, the embedded data consistency constraint in MDUNet ensures that the input low-dose images and the low-dose sinograms are consistent as well as incorporate the physics imaging geometry. Additionally, multi-parameter training was employed to enhance the model's generalization during the training process. Experimental results based on AAPM Low-dose CT datasets show that the proposed MDUNet significantly outperforms other state-of-the-art (SOTA) methods quantitatively and qualitatively. Also, the cascaded blocks reduce the computational complexity with reduced training parameters and generalize well on different datasets. In addition, the proposed MDUNet is validated on 8 different organs of interest, with more detailed structures recovered and high-quality images generated. The experimental results demonstrate that the proposed MDUNet generates favorable improvement over other competing methods in terms of visual quality, quantitative performance, and computational efficiency. The MDUNet has improved image quality with reduced computational cost and good generalization which effectively lowers radiation dose and reduces scanning time, making it favorable for future clinical deployment.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2016 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Siamese network-based trackers have achieved significant improvements in real-time tracking. Despite their success, performance bottlenecks caused by unavoidably complex scenarios in target-tracking tasks are becoming increasingly non-negligible. For example, occlusion and fast motion are factors that can easily cause tracking failures and are labeled in many high-quality tracking databases as challenging attributes. In addition, Siamese trackers tend to suffer from high memory costs, which restricts their applicability to mobile devices with tight memory budgets. To address these issues, we propose a Specialized teachers Distilled Siamese Tracker (SDST) framework to learn a student tracker, which is small, fast, and has enhanced performance in challenging attributes. SDST introduces two types of teachers for multi-teacher distillation: general teacher and specialized teachers. The former imparts basic knowledge to the students. The latter is used to transfer specialized knowledge to students, which helps improve their performance in challenging attributes. For students to efficiently capture critical knowledge from the two types of teachers, SDST is equipped with a carefully designed multi-teacher knowledge distillation model. Our model contains two processes: general teacher-student knowledge transfer and specialized teachers-student knowledge transfer. Extensive empirical evaluations of several popular Siamese trackers demonstrated the generality and effectiveness of our framework. Moreover, the results on Large-scale Single Object Tracking (LaSOT) show that the proposed method achieves a significant improvement of more than 2–4% in most challenging attributes. SDST also maintained high overall performance while achieving compression rates of up to 8x and framerates of 252 FPS and obtaining outstanding accuracy on all challenging attributes.
{"title":"A framework of specialized knowledge distillation for Siamese tracker on challenging attributes","authors":"Yiding Li, Atsushi Shimada, Tsubasa Minematsu, Cheng Tang","doi":"10.1007/s00138-024-01578-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01578-4","url":null,"abstract":"<p>In recent years, Siamese network-based trackers have achieved significant improvements in real-time tracking. Despite their success, performance bottlenecks caused by unavoidably complex scenarios in target-tracking tasks are becoming increasingly non-negligible. For example, occlusion and fast motion are factors that can easily cause tracking failures and are labeled in many high-quality tracking databases as challenging attributes. In addition, Siamese trackers tend to suffer from high memory costs, which restricts their applicability to mobile devices with tight memory budgets. To address these issues, we propose a Specialized teachers Distilled Siamese Tracker (SDST) framework to learn a student tracker, which is small, fast, and has enhanced performance in challenging attributes. SDST introduces two types of teachers for multi-teacher distillation: general teacher and specialized teachers. The former imparts basic knowledge to the students. The latter is used to transfer specialized knowledge to students, which helps improve their performance in challenging attributes. For students to efficiently capture critical knowledge from the two types of teachers, SDST is equipped with a carefully designed multi-teacher knowledge distillation model. Our model contains two processes: general teacher-student knowledge transfer and specialized teachers-student knowledge transfer. Extensive empirical evaluations of several popular Siamese trackers demonstrated the generality and effectiveness of our framework. Moreover, the results on Large-scale Single Object Tracking (LaSOT) show that the proposed method achieves a significant improvement of more than 2–4% in most challenging attributes. SDST also maintained high overall performance while achieving compression rates of up to 8x and framerates of 252 FPS and obtaining outstanding accuracy on all challenging attributes.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"16 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}