Pub Date : 2024-05-28DOI: 10.1007/s00138-024-01558-8
Limin Xia, Qiyue Xiao
Human–object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder–decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model’s computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.
{"title":"Human–object interaction detection based on disentangled axial attention transformer","authors":"Limin Xia, Qiyue Xiao","doi":"10.1007/s00138-024-01558-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01558-8","url":null,"abstract":"<p>Human–object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder–decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model’s computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"21 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-18DOI: 10.1007/s00138-024-01543-1
Sushmita Sarker, Prithul Sarker, Gunner Stone, Ryan Gorman, Alireza Tavakkoli, George Bebis, Javad Sattarvand
Point cloud analysis has a wide range of applications in many areas such as computer vision, robotic manipulation, and autonomous driving. While deep learning has achieved remarkable success on image-based tasks, there are many unique challenges faced by deep neural networks in processing massive, unordered, irregular and noisy 3D points. To stimulate future research, this paper analyzes recent progress in deep learning methods employed for point cloud processing and presents challenges and potential directions to advance this field. It serves as a comprehensive review on two major tasks in 3D point cloud processing—namely, 3D shape classification and semantic segmentation.
{"title":"A comprehensive overview of deep learning techniques for 3D point cloud classification and semantic segmentation","authors":"Sushmita Sarker, Prithul Sarker, Gunner Stone, Ryan Gorman, Alireza Tavakkoli, George Bebis, Javad Sattarvand","doi":"10.1007/s00138-024-01543-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01543-1","url":null,"abstract":"<p>Point cloud analysis has a wide range of applications in many areas such as computer vision, robotic manipulation, and autonomous driving. While deep learning has achieved remarkable success on image-based tasks, there are many unique challenges faced by deep neural networks in processing massive, unordered, irregular and noisy 3D points. To stimulate future research, this paper analyzes recent progress in deep learning methods employed for point cloud processing and presents challenges and potential directions to advance this field. It serves as a comprehensive review on two major tasks in 3D point cloud processing—namely, 3D shape classification and semantic segmentation.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141059136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-15DOI: 10.1007/s00138-024-01541-3
Jan Küchler, Daniel Kröll, Sebastian Schoenen, Andreas Witte
Deep neural network models for image segmentation can be a powerful tool for the automation of motor claims handling processes in the insurance industry. A crucial aspect is the reliability of the model outputs when facing adverse conditions, such as low quality photos taken by claimants to document damages. We explore the use of a meta-classification model to empirically assess the precision of segments predicted by a model trained for the semantic segmentation of car body parts. Different sets of features correlated with the quality of a segment are compared, and an AUROC score of 0.915 is achieved for distinguishing between high- and low-quality segments. By removing low-quality segments, the average (m{textit{IoU}} ) of the segmentation output is improved by 16 percentage points and the number of wrongly predicted segments is reduced by 77%.
{"title":"Uncertainty estimates for semantic segmentation: providing enhanced reliability for automated motor claims handling","authors":"Jan Küchler, Daniel Kröll, Sebastian Schoenen, Andreas Witte","doi":"10.1007/s00138-024-01541-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01541-3","url":null,"abstract":"<p>Deep neural network models for image segmentation can be a powerful tool for the automation of motor claims handling processes in the insurance industry. A crucial aspect is the reliability of the model outputs when facing adverse conditions, such as low quality photos taken by claimants to document damages. We explore the use of a meta-classification model to empirically assess the precision of segments predicted by a model trained for the semantic segmentation of car body parts. Different sets of features correlated with the quality of a segment are compared, and an AUROC score of 0.915 is achieved for distinguishing between high- and low-quality segments. By removing low-quality segments, the average <span>(m{textit{IoU}} )</span> of the segmentation output is improved by 16 percentage points and the number of wrongly predicted segments is reduced by 77%.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"25 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141059207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The extensive deployment of camera-based IoT devices in our society is heightening the vulnerability of citizens’ sensitive information and individual data privacy. In this context, thermal imaging techniques become essential for data desensitization, entailing the elimination of sensitive data to safeguard individual privacy. Meanwhile, thermal imaging techniques can also play a important role in industry by considering the industrial environment with low resolution, high noise and unclear objects’ features. Moreover, existing works often process the entire video as a single entity, which results in suboptimal robustness by overlooking individual actions occurring at different times. In this paper, we propose a lightweight algorithm for action recognition in thermal infrared videos using human skeletons to address this. Our approach includes YOLOv7-tiny for target detection, Alphapose for pose estimation, dynamic skeleton modeling, and Graph Convolutional Networks (GCN) for spatial-temporal feature extraction in action prediction. To overcome detection and pose challenges, we created OQ35-human and OQ35-keypoint datasets for training. Besides, the proposed model enhances robustness by using visible spectrum data for GCN training. Furthermore, we introduce the two-stream shift Graph Convolutional Network to improve the action recognition accuracy. Our experimental results on the custom thermal infrared action dataset (InfAR-skeleton) demonstrate Top-1 accuracy of 88.06% and Top-5 accuracy of 98.28%. On the filtered kinetics-skeleton dataset, the algorithm achieves Top-1 accuracy of 55.26% and Top-5 accuracy of 83.98%. Thermal Infrared Action Recognition ensures the protection of individual privacy while meeting the requirements of action recognition.
{"title":"Thermal infrared action recognition with two-stream shift Graph Convolutional Network","authors":"Jishi Liu, Huanyu Wang, Junnian Wang, Dalin He, Ruihan Xu, Xiongfeng Tang","doi":"10.1007/s00138-024-01550-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01550-2","url":null,"abstract":"<p>The extensive deployment of camera-based IoT devices in our society is heightening the vulnerability of citizens’ sensitive information and individual data privacy. In this context, thermal imaging techniques become essential for data desensitization, entailing the elimination of sensitive data to safeguard individual privacy. Meanwhile, thermal imaging techniques can also play a important role in industry by considering the industrial environment with low resolution, high noise and unclear objects’ features. Moreover, existing works often process the entire video as a single entity, which results in suboptimal robustness by overlooking individual actions occurring at different times. In this paper, we propose a lightweight algorithm for action recognition in thermal infrared videos using human skeletons to address this. Our approach includes YOLOv7-tiny for target detection, Alphapose for pose estimation, dynamic skeleton modeling, and Graph Convolutional Networks (GCN) for spatial-temporal feature extraction in action prediction. To overcome detection and pose challenges, we created OQ35-human and OQ35-keypoint datasets for training. Besides, the proposed model enhances robustness by using visible spectrum data for GCN training. Furthermore, we introduce the two-stream shift Graph Convolutional Network to improve the action recognition accuracy. Our experimental results on the custom thermal infrared action dataset (InfAR-skeleton) demonstrate Top-1 accuracy of 88.06% and Top-5 accuracy of 98.28%. On the filtered kinetics-skeleton dataset, the algorithm achieves Top-1 accuracy of 55.26% and Top-5 accuracy of 83.98%. Thermal Infrared Action Recognition ensures the protection of individual privacy while meeting the requirements of action recognition.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"17 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.
{"title":"Multi sentence description of complex manipulation action videos","authors":"Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter","doi":"10.1007/s00138-024-01547-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01547-x","url":null,"abstract":"<p>Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"43 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s00138-024-01537-z
Jingfu Shen, Yuanliang Zhang, Feiyue Liu, Chun Liu
To achieve real-time segmentation with accurate delineation for feasible areas and target recognition in Unmanned Surface Cleaning Vessel (USCV) image processing, a segmentation approach leveraging visual sensors on USCVs was developed. Initial data collection was executed with remote-controlled cleaning vessels, followed by data cleansing, image deduplication, and manual selection. This led to the creation of WaterSeg dataset, tailored for segmentation tasks in USCV contexts. Upon comparing various deep learning-driven semantic segmentation techniques, a novel, efficient Muti-Cascade Semantic Segmentation Network (MCSSNet) emerged. Comprehensive tests demonstrated that, relative to the state of the art, MCSSNet achieved an average accuracy of 90.64%, a segmentation speed of 44.55fps, and a 45% reduction in model parameters.
{"title":"Lightweight segmentation algorithm of feasible area and targets of unmanned surface cleaning vessels","authors":"Jingfu Shen, Yuanliang Zhang, Feiyue Liu, Chun Liu","doi":"10.1007/s00138-024-01537-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01537-z","url":null,"abstract":"<p>To achieve real-time segmentation with accurate delineation for feasible areas and target recognition in Unmanned Surface Cleaning Vessel (USCV) image processing, a segmentation approach leveraging visual sensors on USCVs was developed. Initial data collection was executed with remote-controlled cleaning vessels, followed by data cleansing, image deduplication, and manual selection. This led to the creation of WaterSeg dataset, tailored for segmentation tasks in USCV contexts. Upon comparing various deep learning-driven semantic segmentation techniques, a novel, efficient Muti-Cascade Semantic Segmentation Network (MCSSNet) emerged. Comprehensive tests demonstrated that, relative to the state of the art, MCSSNet achieved an average accuracy of 90.64%, a segmentation speed of 44.55fps, and a 45% reduction in model parameters.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"253 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-07DOI: 10.1007/s00138-024-01548-w
Duy Cuong Bui, Ngan Linh Nguyen, Anh Hiep Hoang, Myungsik Yoo
Object tracking has emerged as an essential process for various applications in the field of computer vision, such as autonomous driving. Recently, object tracking technology has experienced rapid growth, particularly its applications in self-driving vehicles. Tracking systems typically follow the detection-based tracking paradigm, which is affected by the detection results. Although deep learning has led to significant improvements in object detection, data association remains dependent on factors such as spatial location, motion, and appearance, to associate new observations with existing tracks. In this study, we introduce a novel approach called Combined Appearance-Motion Tracking (CAMTrack) to enhance data association by integrating object appearances and their corresponding movements. The proposed tracking method utilizes an appearance-motion model using an appearance-affinity network and an Interactive Multiple Model (IMM). We deploy the appearance model to address the visual affinity between objects across frames and employed the motion model to incorporate motion constraints to obtain robust position predictions under maneuvering movements. Moreover, we also propose a Two-phase association algorithm which is an effective way to recover lost tracks back from previous frames. CAMTrack was evaluated on the widely recognized object tracking benchmarks-KITTI and MOT17. The results showed the superior performance of the proposed method, highlighting its potential to contribute to advances in object tracking.
{"title":"CAMTrack: a combined appearance-motion method for multiple-object tracking","authors":"Duy Cuong Bui, Ngan Linh Nguyen, Anh Hiep Hoang, Myungsik Yoo","doi":"10.1007/s00138-024-01548-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01548-w","url":null,"abstract":"<p>Object tracking has emerged as an essential process for various applications in the field of computer vision, such as autonomous driving. Recently, object tracking technology has experienced rapid growth, particularly its applications in self-driving vehicles. Tracking systems typically follow the detection-based tracking paradigm, which is affected by the detection results. Although deep learning has led to significant improvements in object detection, data association remains dependent on factors such as spatial location, motion, and appearance, to associate new observations with existing tracks. In this study, we introduce a novel approach called Combined Appearance-Motion Tracking (CAMTrack) to enhance data association by integrating object appearances and their corresponding movements. The proposed tracking method utilizes an appearance-motion model using an appearance-affinity network and an Interactive Multiple Model (IMM). We deploy the appearance model to address the visual affinity between objects across frames and employed the motion model to incorporate motion constraints to obtain robust position predictions under maneuvering movements. Moreover, we also propose a Two-phase association algorithm which is an effective way to recover lost tracks back from previous frames. CAMTrack was evaluated on the widely recognized object tracking benchmarks-KITTI and MOT17. The results showed the superior performance of the proposed method, highlighting its potential to contribute to advances in object tracking.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"122 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-05DOI: 10.1007/s00138-024-01545-z
Zhengdan Yin, Sanping Zhou, Le Wang, Tao Dai, Gang Hua, Nanning Zheng
Gaze estimation aims to predict accurate gaze direction from natural eye images, which is an extreme challenging task due to both random variations in head pose and person-specific biases. Existing works often independently learn features from binocular images and directly concatenate them for gaze estimation. In this paper, we propose a simple yet effective two-stage framework for gaze estimation, in which both residual feature learning (RFL) and hierarchical gaze calibration (HGC) networks are designed to consistently improve the performance of gaze estimation. Specifically, the RFL network extracts informative features by jointly exploring the symmetric and asymmetric factors between left and right eyes, which can produce accurate initial predictions as much as possible. Besides, the HGC network cascades a personal-specific transform module to further transform the distribution of gaze point from coarse to fine, which can effectively compensate the subjective bias in initial predictions. Extensive experiments on both EVE and MPIIGaze datasets show that our method outperforms the state-of-the-art approaches.
注视估计旨在从自然眼球图像中预测准确的注视方向,由于头部姿势的随机变化和特定人的偏差,这是一项极具挑战性的任务。现有的工作通常是独立地从双目图像中学习特征,然后直接串联起来进行注视估计。在本文中,我们提出了一个简单而有效的两阶段凝视估计框架,其中残差特征学习(RFL)和分层凝视校准(HGC)网络的设计能持续提高凝视估计的性能。具体来说,残差特征学习网络通过联合探索左右眼的对称和不对称因素来提取信息特征,从而尽可能产生准确的初始预测。此外,HGC 网络级联了个人特定的变换模块,进一步对注视点的分布进行由粗到细的变换,从而有效弥补了初始预测的主观偏差。在 EVE 和 MPIIGaze 数据集上进行的大量实验表明,我们的方法优于最先进的方法。
{"title":"Residual feature learning with hierarchical calibration for gaze estimation","authors":"Zhengdan Yin, Sanping Zhou, Le Wang, Tao Dai, Gang Hua, Nanning Zheng","doi":"10.1007/s00138-024-01545-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01545-z","url":null,"abstract":"<p>Gaze estimation aims to predict accurate gaze direction from natural eye images, which is an extreme challenging task due to both random variations in head pose and person-specific biases. Existing works often independently learn features from binocular images and directly concatenate them for gaze estimation. In this paper, we propose a simple yet effective two-stage framework for gaze estimation, in which both residual feature learning (RFL) and hierarchical gaze calibration (HGC) networks are designed to consistently improve the performance of gaze estimation. Specifically, the RFL network extracts informative features by jointly exploring the symmetric and asymmetric factors between left and right eyes, which can produce accurate initial predictions as much as possible. Besides, the HGC network cascades a personal-specific transform module to further transform the distribution of gaze point from coarse to fine, which can effectively compensate the subjective bias in initial predictions. Extensive experiments on both EVE and MPIIGaze datasets show that our method outperforms the state-of-the-art approaches.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-03DOI: 10.1007/s00138-024-01546-y
Jonathan Boisclair, Ali Amamou, Sousso Kelouwani, M. Zeshan Alam, Hedi Oueslati, Lotfi Zeghmi, Kodjo Agbossou
Research on autonomous vehicles has been at a peak recently. One of the most researched aspects is the performance degradation of sensors in harsh weather conditions such as rain, snow, fog, and hail. This work addresses this performance degradation by fusing multiple sensor modalities inside the neural network used for detection. The proposed fusion method removes the pre-process fusion stage. It directly produces detection boxes from numerous images. It reduces the computation cost by providing detection and fusion simultaneously. By separating the network during the initial layers, the network can easily be modified for new sensors. Intra-network fusion improves robustness to missing inputs and applies to all compatible types of inputs while reducing the peak computing cost by using a valley-fill algorithm. Our experiments demonstrate that adopting a parallel multimodal network to fuse thermal images in the network improves object detection during difficult weather conditions such as harsh winters by up to 5% mAP while reducing dataset bias during complicated weather conditions. It also happens with around 50% fewer parameters than late-fusion approaches, which duplicate the whole network instead of the first section of the feature extractor.
{"title":"A tree-based approach for visible and thermal sensor fusion in winter autonomous driving","authors":"Jonathan Boisclair, Ali Amamou, Sousso Kelouwani, M. Zeshan Alam, Hedi Oueslati, Lotfi Zeghmi, Kodjo Agbossou","doi":"10.1007/s00138-024-01546-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01546-y","url":null,"abstract":"<p>Research on autonomous vehicles has been at a peak recently. One of the most researched aspects is the performance degradation of sensors in harsh weather conditions such as rain, snow, fog, and hail. This work addresses this performance degradation by fusing multiple sensor modalities inside the neural network used for detection. The proposed fusion method removes the pre-process fusion stage. It directly produces detection boxes from numerous images. It reduces the computation cost by providing detection and fusion simultaneously. By separating the network during the initial layers, the network can easily be modified for new sensors. Intra-network fusion improves robustness to missing inputs and applies to all compatible types of inputs while reducing the peak computing cost by using a valley-fill algorithm. Our experiments demonstrate that adopting a parallel multimodal network to fuse thermal images in the network improves object detection during difficult weather conditions such as harsh winters by up to 5% mAP while reducing dataset bias during complicated weather conditions. It also happens with around 50% fewer parameters than late-fusion approaches, which duplicate the whole network instead of the first section of the feature extractor.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"4 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-29DOI: 10.1007/s00138-024-01540-4
Hanqi Yin, Guisheng Yin, Yiming Sun, Liguo Zhang, Ye Tian
Semantic segmentation plays a crucial role in various computer vision tasks, such as autonomous driving in urban scenes. The related researches have made significant progress. However, since most of the researches focus on how to enhance the performance of semantic segmentation models, there is a noticeable lack of attention given to the performance deterioration of these models in severe weather. To address this issue, we study the robustness of the multimodal semantic segmentation model in snowy environment, which represents a subset of severe weather conditions. The proposed method generates realistically simulated snowy environment images by combining unpaired image translation with adversarial snowflake generation, effectively misleading the segmentation model’s predictions. These generated adversarial images are then utilized for model robustness learning, enabling the model to adapt to the harshest snowy environment and enhancing its robustness to artificially adversarial perturbance to some extent. The experimental visualization results show that the proposed method can generate approximately realistic snowy environment images, and yield satisfactory visual effects for both daytime and nighttime scenes. Moreover, the experimental quantitation results generated by MFNet Dataset indicate that compared with the model without enhancement, the proposed method achieves average improvements of 4.82% and 3.95% on mAcc and mIoU, respectively. These improvements enhance the adaptability of the multimodal semantic segmentation model to snowy environments and contribute to road safety. Furthermore, the proposed method demonstrates excellent applicability, as it can be seamlessly integrated into various multimodal semantic segmentation models.
{"title":"Robust semantic segmentation method of urban scenes in snowy environment","authors":"Hanqi Yin, Guisheng Yin, Yiming Sun, Liguo Zhang, Ye Tian","doi":"10.1007/s00138-024-01540-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01540-4","url":null,"abstract":"<p>Semantic segmentation plays a crucial role in various computer vision tasks, such as autonomous driving in urban scenes. The related researches have made significant progress. However, since most of the researches focus on how to enhance the performance of semantic segmentation models, there is a noticeable lack of attention given to the performance deterioration of these models in severe weather. To address this issue, we study the robustness of the multimodal semantic segmentation model in snowy environment, which represents a subset of severe weather conditions. The proposed method generates realistically simulated snowy environment images by combining unpaired image translation with adversarial snowflake generation, effectively misleading the segmentation model’s predictions. These generated adversarial images are then utilized for model robustness learning, enabling the model to adapt to the harshest snowy environment and enhancing its robustness to artificially adversarial perturbance to some extent. The experimental visualization results show that the proposed method can generate approximately realistic snowy environment images, and yield satisfactory visual effects for both daytime and nighttime scenes. Moreover, the experimental quantitation results generated by MFNet Dataset indicate that compared with the model without enhancement, the proposed method achieves average improvements of 4.82% and 3.95% on mAcc and mIoU, respectively. These improvements enhance the adaptability of the multimodal semantic segmentation model to snowy environments and contribute to road safety. Furthermore, the proposed method demonstrates excellent applicability, as it can be seamlessly integrated into various multimodal semantic segmentation models.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"8 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}