Recently, image captioning has become an intriguing task that has attracted many researchers. This paper proposes a novel keypoint-based segmentation algorithm for extracting regions of interest (ROI) and an image captioning model guided by this information to generate more accurate image captions. The Difference of Gaussian (DoG) is used to identify keypoints. A novel ROI segmentation algorithm then utilizes these keypoints to extract the ROI. Features of the ROI are extracted, and the text features of related images are merged into a common semantic space using canonical correlation analysis (CCA) to produce the guiding information. The text features are constructed using a Bag of Words (BoW) model. Based on the guiding information and the entire image features, an LSTM generates a caption for the image. The guiding information helps the LSTM focus on important semantic regions in the image to generate the most significant keywords in the image caption. Experiments on the Flickr8k dataset show that the proposed ROI segmentation algorithm accurately identifies the ROI, and the image captioning model with the guidance information outperforms state-of-the-art methods.
近来,图像标题已成为一项吸引众多研究人员的有趣任务。本文提出了一种新颖的基于关键点的分割算法,用于提取感兴趣区域(ROI),并在此信息指导下建立图像字幕模型,以生成更准确的图像字幕。高斯差(DoG)用于识别关键点。然后,一种新颖的 ROI 分割算法利用这些关键点提取 ROI。提取 ROI 的特征,并使用规范相关分析 (CCA) 将相关图像的文本特征合并到一个共同的语义空间,从而生成引导信息。文本特征使用词袋(BoW)模型构建。基于引导信息和整个图像特征,LSTM 为图像生成标题。引导信息可帮助 LSTM 专注于图像中的重要语义区域,从而生成图像标题中最重要的关键词。在 Flickr8k 数据集上进行的实验表明,所提出的 ROI 分割算法能准确识别 ROI,带有引导信息的图像标题模型优于最先进的方法。
{"title":"A novel key point based ROI segmentation and image captioning using guidance information","authors":"Jothi Lakshmi Selvakani, Bhuvaneshwari Ranganathan, Geetha Palanisamy","doi":"10.1007/s00138-024-01597-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01597-1","url":null,"abstract":"<p>Recently, image captioning has become an intriguing task that has attracted many researchers. This paper proposes a novel keypoint-based segmentation algorithm for extracting regions of interest (ROI) and an image captioning model guided by this information to generate more accurate image captions. The Difference of Gaussian (DoG) is used to identify keypoints. A novel ROI segmentation algorithm then utilizes these keypoints to extract the ROI. Features of the ROI are extracted, and the text features of related images are merged into a common semantic space using canonical correlation analysis (CCA) to produce the guiding information. The text features are constructed using a Bag of Words (BoW) model. Based on the guiding information and the entire image features, an LSTM generates a caption for the image. The guiding information helps the LSTM focus on important semantic regions in the image to generate the most significant keywords in the image caption. Experiments on the Flickr8k dataset show that the proposed ROI segmentation algorithm accurately identifies the ROI, and the image captioning model with the guidance information outperforms state-of-the-art methods.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2011 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1007/s00138-024-01603-6
Hirotaka Hachiya, Yuto Yoshimura
To apply robot teaching to a factory with many mirror-polished parts, it is necessary to detect the specular surface accurately. Deep models for mirror detection have been studied by designing mirror-specific features, e.g., contextual contrast and similarity. However, mirror-polished parts such as plastic molds, tend to have complex shapes and ambiguous boundaries, and thus, existing mirror-specific deep features could not work well. To overcome the problem, we propose introducing attention maps based on the concept of static specular flow (SSF), condensed reflections of the surrounding scene, and specular highlight (SH), bright light spots, frequently appearing even in complex-shaped specular surfaces and applying them to deep model-based multi-level features. Then, we adaptively integrate approximated mirror maps generated by multi-level SSF, SH, and existing mirror detectors to detect complex specular surfaces. Through experiments with our original data sets with spherical mirrors and real-world plastic molds, we show the effectiveness of the proposed method.
{"title":"Specular Surface Detection with Deep Static Specular Flow and Highlight","authors":"Hirotaka Hachiya, Yuto Yoshimura","doi":"10.1007/s00138-024-01603-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01603-6","url":null,"abstract":"<p>To apply robot teaching to a factory with many mirror-polished parts, it is necessary to detect the specular surface accurately. Deep models for mirror detection have been studied by designing mirror-specific features, e.g., contextual contrast and similarity. However, mirror-polished parts such as plastic molds, tend to have complex shapes and ambiguous boundaries, and thus, existing mirror-specific deep features could not work well. To overcome the problem, we propose introducing attention maps based on the concept of static specular flow (SSF), condensed reflections of the surrounding scene, and specular highlight (SH), bright light spots, frequently appearing even in complex-shaped specular surfaces and applying them to deep model-based multi-level features. Then, we adaptively integrate approximated mirror maps generated by multi-level SSF, SH, and existing mirror detectors to detect complex specular surfaces. Through experiments with our original data sets with spherical mirrors and real-world plastic molds, we show the effectiveness of the proposed method.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"61 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1007/s00138-024-01607-2
Amal Chaoui, Jay Paul Morgan, Adeline Paiement, Jean Aboudarham
The study and prediction of space weather entails the analysis of solar images showing structures of the Sun’s atmosphere. When imaged from the Earth’s ground, images may be polluted by terrestrial clouds which hinder the detection of solar structures. We propose a new method to remove cloud shadows, based on a U-Net architecture, and compare classical supervision with conditional GAN. We evaluate our method on two different imaging modalities, using both real images and a new dataset of synthetic clouds. Quantitative assessments are obtained through image quality indices (RMSE, PSNR, SSIM, and FID). We demonstrate improved results with regards to the traditional cloud removal technique and a sparse coding baseline, on different cloud types and textures.
研究和预测空间天气需要分析显示太阳大气结构的太阳图像。从地球表面拍摄图像时,图像可能会受到地面云层的污染,从而阻碍对太阳结构的检测。我们提出了一种基于 U-Net 架构的去除云影的新方法,并将经典监督与条件 GAN 进行了比较。我们使用真实图像和新的合成云数据集,在两种不同的成像模式上对我们的方法进行了评估。通过图像质量指标(RMSE、PSNR、SSIM 和 FID)进行定量评估。与传统的云去除技术和稀疏编码基线相比,我们在不同的云类型和纹理上展示了更好的效果。
{"title":"Removing cloud shadows from ground-based solar imagery","authors":"Amal Chaoui, Jay Paul Morgan, Adeline Paiement, Jean Aboudarham","doi":"10.1007/s00138-024-01607-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01607-2","url":null,"abstract":"<p>The study and prediction of space weather entails the analysis of solar images showing structures of the Sun’s atmosphere. When imaged from the Earth’s ground, images may be polluted by terrestrial clouds which hinder the detection of solar structures. We propose a new method to remove cloud shadows, based on a U-Net architecture, and compare classical supervision with conditional GAN. We evaluate our method on two different imaging modalities, using both real images and a new dataset of synthetic clouds. Quantitative assessments are obtained through image quality indices (RMSE, PSNR, SSIM, and FID). We demonstrate improved results with regards to the traditional cloud removal technique and a sparse coding baseline, on different cloud types and textures.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"13 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1007/s00138-024-01606-3
Chao Yang, Ce Zhang, Longyu Jiang, Xinwen Zhang
Underwater object detection and classification technology is one of the most important ways for humans to explore the oceans. However, existing methods are still insufficient in terms of accuracy and speed, and have poor detection performance for small objects such as fish. In this paper, we propose a multi-scale aggregation enhanced (MAE-FPN) object detection method based on the feature pyramid network, including the multi-scale convolutional calibration module (MCCM) and the feature calibration distribution module (FCDM). First, we design the MCCM module, which can adaptively extract feature information from objects at different scales. Then, we built the FCDM structure to make the multi-scale information fusion more appropriate and to alleviate the problem of missing features from small objects. Finally, we construct the Fish Segmentation and Detection (FSD) dataset by fusing multiple data augmentation methods, which enriches the data resources for underwater object detection and solves the problem of limited training resources for deep learning. We conduct experiments on FSD and public datasets, and the results show that the proposed MAE-FPN network significantly improves the detection performance of underwater objects, especially small objects.
{"title":"Underwater image object detection based on multi-scale feature fusion","authors":"Chao Yang, Ce Zhang, Longyu Jiang, Xinwen Zhang","doi":"10.1007/s00138-024-01606-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01606-3","url":null,"abstract":"<p>Underwater object detection and classification technology is one of the most important ways for humans to explore the oceans. However, existing methods are still insufficient in terms of accuracy and speed, and have poor detection performance for small objects such as fish. In this paper, we propose a multi-scale aggregation enhanced (MAE-FPN) object detection method based on the feature pyramid network, including the multi-scale convolutional calibration module (MCCM) and the feature calibration distribution module (FCDM). First, we design the MCCM module, which can adaptively extract feature information from objects at different scales. Then, we built the FCDM structure to make the multi-scale information fusion more appropriate and to alleviate the problem of missing features from small objects. Finally, we construct the Fish Segmentation and Detection (FSD) dataset by fusing multiple data augmentation methods, which enriches the data resources for underwater object detection and solves the problem of limited training resources for deep learning. We conduct experiments on FSD and public datasets, and the results show that the proposed MAE-FPN network significantly improves the detection performance of underwater objects, especially small objects.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"113 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1007/s00138-024-01604-5
Ming Jing, Zhilong Ou, Hongxing Wang, Jiaxin Li, Ziyi Zhao
Active learning has achieved great success in image classification because of selecting the most informative samples for data labeling and model training. However, the potential of active learning has been far from being realised in object detection due to its unique challenge in utilizing localization information. A popular compromise is to simply take active classification learning over detected object candidates. To consider the localization information of object detection, current effort usually falls into the model-dependent fashion, which either works on specific detection frameworks or relies on additionally designed modules. In this paper, we propose model-agnostic Object Recognition Consistency in Regression (ORCR), which can holistically measure the uncertainty information of classification and localization of each detected candidate from object detection. The philosophy behind ORCR is to obtain the detection uncertainty by calculating the classification consistency through localization regression at two successive detection scales. In the light of the proposed ORCR, we devise an active learning framework that enables an effortless deployment to any object detection architecture. Experimental results on the PASCAL VOC and MS-COCO benchmarks show that our method achieves better performance while simplifying the active detection process.
{"title":"Object Recognition Consistency in Regression for Active Detection","authors":"Ming Jing, Zhilong Ou, Hongxing Wang, Jiaxin Li, Ziyi Zhao","doi":"10.1007/s00138-024-01604-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01604-5","url":null,"abstract":"<p>Active learning has achieved great success in image classification because of selecting the most informative samples for data labeling and model training. However, the potential of active learning has been far from being realised in object detection due to its unique challenge in utilizing localization information. A popular compromise is to simply take active classification learning over detected object candidates. To consider the localization information of object detection, current effort usually falls into the model-dependent fashion, which either works on specific detection frameworks or relies on additionally designed modules. In this paper, we propose model-agnostic Object Recognition Consistency in Regression (ORCR), which can holistically measure the uncertainty information of classification and localization of each detected candidate from object detection. The philosophy behind ORCR is to obtain the detection uncertainty by calculating the classification consistency through localization regression at two successive detection scales. In the light of the proposed ORCR, we devise an active learning framework that enables an effortless deployment to any object detection architecture. Experimental results on the PASCAL VOC and MS-COCO benchmarks show that our method achieves better performance while simplifying the active detection process.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"73 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1007/s00138-024-01601-8
Hongyi Qin, Alexander G. Belyaev
This paper presents a deep learning method for image dehazing and clarification. The main advantages of the method are high computational speed and using unpaired image data for training. The method adapts the Zero-DCE approach (Li et al. in IEEE Trans Pattern Anal Mach Intell 44(8):4225–4238, 2021) for the image dehazing problem and uses high-order curves to adjust the dynamic range of images and achieve dehazing. Training the proposed dehazing neural network does not require paired hazy and clear datasets but instead utilizes a set of loss functions, assessing the quality of dehazed images to drive the training process. Experiments on a large number of real-world hazy images demonstrate that our proposed network effectively removes haze while preserving details and enhancing brightness. Furthermore, on an affordable GPU-equipped laptop, the processing speed can reach 1000 FPS for images with 2K resolution, making it highly suitable for real-time dehazing applications.
{"title":"Fast no-reference deep image dehazing","authors":"Hongyi Qin, Alexander G. Belyaev","doi":"10.1007/s00138-024-01601-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01601-8","url":null,"abstract":"<p>This paper presents a deep learning method for image dehazing and clarification. The main advantages of the method are high computational speed and using unpaired image data for training. The method adapts the Zero-DCE approach (Li et al. in IEEE Trans Pattern Anal Mach Intell 44(8):4225–4238, 2021) for the image dehazing problem and uses high-order curves to adjust the dynamic range of images and achieve dehazing. Training the proposed dehazing neural network does not require paired hazy and clear datasets but instead utilizes a set of loss functions, assessing the quality of dehazed images to drive the training process. Experiments on a large number of real-world hazy images demonstrate that our proposed network effectively removes haze while preserving details and enhancing brightness. Furthermore, on an affordable GPU-equipped laptop, the processing speed can reach 1000 FPS for images with 2K resolution, making it highly suitable for real-time dehazing applications.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"18 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open set recognition (OSR) aims to accept and classify known classes while rejecting unknown classes, which is the key technology for pattern recognition algorithms to be widely applied in practice. The challenges to OSR is to reduce the empirical classification risk of known classes and the open space risk of potential unknown classes. However, the existing OSR methods less consider to optimize the open space risk, and much dark information in unknown space is not taken into account, which results in that many unknown classes are misidentified as known classes. Therefore, we present a self-supervised learningbased OSR method with synergetic proto-pull and reciprocal points, which can remarkably reduce the risks of empirical classification and open space. Especially, we propose a new concept of proto-pull point, which can be synergistically combined with reciprocal points to shrink the feature spaces of known and unknown classes, and increase the feature distance between different classes, so as to form a good feature distribution. In addition, a self-supervised learning task of identifying the directions of rotated images is introduced in OSR model training, which is benefit for the OSR mdoel to capture more distinguishing features, and decreases both empirical classification and open space risks. The final experimental results on benchmark datasets show that our propsoed approach outperforms most existing OSR methods.
{"title":"Synergetic proto-pull and reciprocal points for open set recognition","authors":"Xin Deng, Luyao Yang, Ao Zhang, Jingwen Wang, Hexu Wang, Tianzhang Xing, Pengfei Xu","doi":"10.1007/s00138-024-01596-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01596-2","url":null,"abstract":"<p>Open set recognition (OSR) aims to accept and classify known classes while rejecting unknown classes, which is the key technology for pattern recognition algorithms to be widely applied in practice. The challenges to OSR is to reduce the empirical classification risk of known classes and the open space risk of potential unknown classes. However, the existing OSR methods less consider to optimize the open space risk, and much dark information in unknown space is not taken into account, which results in that many unknown classes are misidentified as known classes. Therefore, we present a self-supervised learningbased OSR method with synergetic proto-pull and reciprocal points, which can remarkably reduce the risks of empirical classification and open space. Especially, we propose a new concept of proto-pull point, which can be synergistically combined with reciprocal points to shrink the feature spaces of known and unknown classes, and increase the feature distance between different classes, so as to form a good feature distribution. In addition, a self-supervised learning task of identifying the directions of rotated images is introduced in OSR model training, which is benefit for the OSR mdoel to capture more distinguishing features, and decreases both empirical classification and open space risks. The final experimental results on benchmark datasets show that our propsoed approach outperforms most existing OSR methods.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"36 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1007/s00138-024-01602-7
Xiangyang Wang, Tao Pei, Rui Wang
Multi-person pose estimation and tracking are crucial research directions in the field of artificial intelligence, with widespread applications in virtual reality, action recognition, and human-computer interaction. While existing pose tracking algorithms predominantly follow the top-down paradigm, they face challenges, such as pose occlusion and motion blur in complex scenes, leading to tracking inaccuracies. To address these challenges, we leverage enhanced keypoint information and pose-weighted re-identification (re-ID) features to improve the performance of multi-person pose estimation and tracking. Specifically, our proposed Decouple Heatmap Network decouples heatmaps into keypoint confidence and position. The refined keypoint information are utilized to reconstruct occluded poses. For the pose tracking task, we introduce a more efficient pipeline founded on pose-weighted re-ID features. This pipeline integrates a Pose Embedding Network to allocate weights to re-ID features and achieves the final pose tracking through a novel tracking matching algorithm. Extensive experiments indicate that our approach performs well in both multi-person pose estimation and tracking and achieves state-of-the-art results on the PoseTrack 2017 and 2018 datasets. Our source code is available at: https://github.com/TaoTaoPei/posetracking.
{"title":"Enhanced keypoint information and pose-weighted re-ID features for multi-person pose estimation and tracking","authors":"Xiangyang Wang, Tao Pei, Rui Wang","doi":"10.1007/s00138-024-01602-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01602-7","url":null,"abstract":"<p>Multi-person pose estimation and tracking are crucial research directions in the field of artificial intelligence, with widespread applications in virtual reality, action recognition, and human-computer interaction. While existing pose tracking algorithms predominantly follow the top-down paradigm, they face challenges, such as pose occlusion and motion blur in complex scenes, leading to tracking inaccuracies. To address these challenges, we leverage enhanced keypoint information and pose-weighted re-identification (re-ID) features to improve the performance of multi-person pose estimation and tracking. Specifically, our proposed Decouple Heatmap Network decouples heatmaps into keypoint confidence and position. The refined keypoint information are utilized to reconstruct occluded poses. For the pose tracking task, we introduce a more efficient pipeline founded on pose-weighted re-ID features. This pipeline integrates a Pose Embedding Network to allocate weights to re-ID features and achieves the final pose tracking through a novel tracking matching algorithm. Extensive experiments indicate that our approach performs well in both multi-person pose estimation and tracking and achieves state-of-the-art results on the PoseTrack 2017 and 2018 datasets. Our source code is available at: https://github.com/TaoTaoPei/posetracking.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"8 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Search and rescue (SaR) is challenging, due to the unknown environmental situation after disasters occur. Robotics has become indispensable for precise mapping of the environment and for locating the victims. Combining flying and ground robots more effectively serves this purpose, due to their complementary features in terms of viewpoint and maneuvering. To this end, a novel, cost-effective framework for mapping unknown environments is introduced that leverages You Only Look Once and video streams transmitted by a ground and a flying robot. The integrated mapping approach is for performing three crucial SaR tasks: localizing the victims, i.e., determining their position in the environment and their body pose, tracking the moving victims, and providing a map of the ground elevation that assists both the ground robot and the SaR crew in navigating the SaR environment. In real-life experiments at the CyberZoo of the Delft University of Technology, the framework proved very effective and precise for all these tasks, particularly in occluded and complex environments.
由于灾害发生后的环境状况未知,搜救工作极具挑战性。机器人技术已成为精确绘制环境地图和定位受害者不可或缺的工具。由于飞行机器人和地面机器人在视角和操纵方面具有互补性,因此它们的结合能更有效地实现这一目的。为此,我们介绍了一种新颖、经济高效的未知环境绘图框架,该框架利用了 "只看一次 "以及地面机器人和飞行机器人传输的视频流。这种集成绘图方法用于执行三项关键的 SaR 任务:定位受害者,即确定他们在环境中的位置和身体姿势;跟踪移动的受害者;提供地面高程图,以协助地面机器人和 SaR 人员在 SaR 环境中导航。在代尔夫特理工大学网络动物园(CyberZoo)的实际实验中,该框架被证明对所有这些任务都非常有效和精确,尤其是在隐蔽和复杂的环境中。
{"title":"Camera-based mapping in search-and-rescue via flying and ground robot teams","authors":"Bernardo Esteves Henriques, Mirko Baglioni, Anahita Jamshidnejad","doi":"10.1007/s00138-024-01594-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01594-4","url":null,"abstract":"<p>Search and rescue (SaR) is challenging, due to the unknown environmental situation after disasters occur. Robotics has become indispensable for precise mapping of the environment and for locating the victims. Combining flying and ground robots more effectively serves this purpose, due to their complementary features in terms of viewpoint and maneuvering. To this end, a novel, cost-effective framework for mapping unknown environments is introduced that leverages You Only Look Once and video streams transmitted by a ground and a flying robot. The integrated mapping approach is for performing three crucial SaR tasks: localizing the victims, i.e., determining their position in the environment and their body pose, tracking the moving victims, and providing a map of the ground elevation that assists both the ground robot and the SaR crew in navigating the SaR environment. In real-life experiments at the CyberZoo of the Delft University of Technology, the framework proved very effective and precise for all these tasks, particularly in occluded and complex environments.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"11 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1007/s00138-024-01599-z
Doanh C. Bui, Tam V. Nguyen, Khang Nguyen
Image captioning is an exciting yet challenging problem in both computer vision and natural language processing research. In recent years, this problem has been addressed by Transformer-based models optimized with Cross-Entropy loss and boosted performance via Self-Critical Sequence Training. Two types of representations are embedded into captioning models: grid features and region features, and there have been attempts to include 2D geometry information in the self-attention computation. However, the 3D order of object appearances is not considered, leading to confusion for the model in cases of complex scenes with overlapped objects. In addition, recent studies using only feature maps from the last layer or block of a pretrained CNN-based model may lack spatial information. In this paper, we present the Transformer-based captioning model dubbed TMDNet. Our model includes one module to aggregate multi-level grid features (MGFA) to enrich the representation ability using prior knowledge, and another module to effectively embed the image’s depth-grid aggregation (DGA) into the model space for better performance. The proposed model demonstrates its effectiveness via evaluation on the MS-COCO “Karpathy” test split across five standard metrics.
{"title":"Transformer with multi-level grid features and depth pooling for image captioning","authors":"Doanh C. Bui, Tam V. Nguyen, Khang Nguyen","doi":"10.1007/s00138-024-01599-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01599-z","url":null,"abstract":"<p>Image captioning is an exciting yet challenging problem in both computer vision and natural language processing research. In recent years, this problem has been addressed by Transformer-based models optimized with Cross-Entropy loss and boosted performance via Self-Critical Sequence Training. Two types of representations are embedded into captioning models: grid features and region features, and there have been attempts to include 2D geometry information in the self-attention computation. However, the 3D order of object appearances is not considered, leading to confusion for the model in cases of complex scenes with overlapped objects. In addition, recent studies using only feature maps from the last layer or block of a pretrained CNN-based model may lack spatial information. In this paper, we present the Transformer-based captioning model dubbed TMDNet. Our model includes one module to aggregate multi-level grid features (MGFA) to enrich the representation ability using prior knowledge, and another module to effectively embed the image’s depth-grid aggregation (DGA) into the model space for better performance. The proposed model demonstrates its effectiveness via evaluation on the MS-COCO “Karpathy” test split across five standard metrics.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}