Machine Vision and Applications最新文献_第6页

Human–object interaction detection based on disentangled axial attention transformer 基于离散轴向注意力变换器的人-物互动检测

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-28 DOI: 10.1007/s00138-024-01558-8

Limin Xia, Qiyue Xiao

Human–object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder–decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model’s computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.

人-物互动（HOI）检测旨在定位和推断图像中人与物体之间的互动。最近的研究提出了用于 HOI 检测的变压器编码器-解码器架构，其性能优异，但也存在一些缺点：它们没有采用完整的解纠缠策略来为不同的子任务学习更多的判别特征；它们无法在每个分支内实现充分的上下文交换，而这对于准确的关系推理至关重要；由于复杂的注意力计算，它们的变压器模型存在计算成本高和内存占用大的问题。在这项工作中，我们提出了一种分解变换器网络，它将编码器和解码器分解为三个分支，分别用于人类检测、物体检测和交互分类。然后，我们提出了一种新颖的特征统一解码器，用于关联每个分解解码器的预测结果，并引入了多路关系嵌入模块和殷勤融合模块，以便在各分支之间进行充分的上下文信息交换。此外，为了降低模型的计算成本，我们在编码器中加入了对位置敏感的轴向注意力，从而使我们的模型在准确性和复杂性之间实现了更好的权衡。我们在两个公开的 HOI 基准上进行了广泛的实验，以证明我们方法的有效性。结果表明，我们的模型优于其他方法，达到了最先进的性能。

{"title":"Human–object interaction detection based on disentangled axial attention transformer","authors":"Limin Xia, Qiyue Xiao","doi":"10.1007/s00138-024-01558-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01558-8","url":null,"abstract":"Human–object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder–decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model’s computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"21 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A comprehensive overview of deep learning techniques for 3D point cloud classification and semantic segmentation 三维点云分类和语义分割的深度学习技术综述

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-18 DOI: 10.1007/s00138-024-01543-1

Sushmita Sarker, Prithul Sarker, Gunner Stone, Ryan Gorman, Alireza Tavakkoli, George Bebis, Javad Sattarvand

Point cloud analysis has a wide range of applications in many areas such as computer vision, robotic manipulation, and autonomous driving. While deep learning has achieved remarkable success on image-based tasks, there are many unique challenges faced by deep neural networks in processing massive, unordered, irregular and noisy 3D points. To stimulate future research, this paper analyzes recent progress in deep learning methods employed for point cloud processing and presents challenges and potential directions to advance this field. It serves as a comprehensive review on two major tasks in 3D point cloud processing—namely, 3D shape classification and semantic segmentation.

点云分析在计算机视觉、机器人操纵和自动驾驶等许多领域有着广泛的应用。虽然深度学习在基于图像的任务中取得了显著成就，但深度神经网络在处理大量、无序、不规则和嘈杂的三维点时面临着许多独特的挑战。为了激励未来的研究，本文分析了点云处理所采用的深度学习方法的最新进展，并提出了推进这一领域的挑战和潜在方向。本文全面回顾了三维点云处理中的两大任务，即三维形状分类和语义分割。

引用次数: 0

Uncertainty estimates for semantic segmentation: providing enhanced reliability for automated motor claims handling 语义分割的不确定性估计：为自动汽车索赔处理提供更高可靠性

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-15 DOI: 10.1007/s00138-024-01541-3

Jan Küchler, Daniel Kröll, Sebastian Schoenen, Andreas Witte

Deep neural network models for image segmentation can be a powerful tool for the automation of motor claims handling processes in the insurance industry. A crucial aspect is the reliability of the model outputs when facing adverse conditions, such as low quality photos taken by claimants to document damages. We explore the use of a meta-classification model to empirically assess the precision of segments predicted by a model trained for the semantic segmentation of car body parts. Different sets of features correlated with the quality of a segment are compared, and an AUROC score of 0.915 is achieved for distinguishing between high- and low-quality segments. By removing low-quality segments, the average (m{textit{IoU}} ) of the segmentation output is improved by 16 percentage points and the number of wrongly predicted segments is reduced by 77%.

用于图像分割的深度神经网络模型是保险业汽车理赔处理流程自动化的有力工具。一个至关重要的方面是，在面临不利条件（例如索赔人为记录损失而拍摄的低质量照片）时，模型输出的可靠性。我们探索了元分类模型的使用方法，以实证评估为车身部件语义分割而训练的模型所预测的分割精确度。我们比较了与片段质量相关的不同特征集，在区分高质量和低质量片段方面，AUROC 得分为 0.915。通过去除低质量片段，分割输出的平均 (m{textit{IoU}} ) 分数提高了 16 个百分点，错误预测的片段数量减少了 77%。

引用次数: 0

Thermal infrared action recognition with two-stream shift Graph Convolutional Network 利用双流移位图卷积网络识别热红外动作

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-13 DOI: 10.1007/s00138-024-01550-2

Jishi Liu, Huanyu Wang, Junnian Wang, Dalin He, Ruihan Xu, Xiongfeng Tang

The extensive deployment of camera-based IoT devices in our society is heightening the vulnerability of citizens’ sensitive information and individual data privacy. In this context, thermal imaging techniques become essential for data desensitization, entailing the elimination of sensitive data to safeguard individual privacy. Meanwhile, thermal imaging techniques can also play a important role in industry by considering the industrial environment with low resolution, high noise and unclear objects’ features. Moreover, existing works often process the entire video as a single entity, which results in suboptimal robustness by overlooking individual actions occurring at different times. In this paper, we propose a lightweight algorithm for action recognition in thermal infrared videos using human skeletons to address this. Our approach includes YOLOv7-tiny for target detection, Alphapose for pose estimation, dynamic skeleton modeling, and Graph Convolutional Networks (GCN) for spatial-temporal feature extraction in action prediction. To overcome detection and pose challenges, we created OQ35-human and OQ35-keypoint datasets for training. Besides, the proposed model enhances robustness by using visible spectrum data for GCN training. Furthermore, we introduce the two-stream shift Graph Convolutional Network to improve the action recognition accuracy. Our experimental results on the custom thermal infrared action dataset (InfAR-skeleton) demonstrate Top-1 accuracy of 88.06% and Top-5 accuracy of 98.28%. On the filtered kinetics-skeleton dataset, the algorithm achieves Top-1 accuracy of 55.26% and Top-5 accuracy of 83.98%. Thermal Infrared Action Recognition ensures the protection of individual privacy while meeting the requirements of action recognition.

随着基于摄像头的物联网设备在社会中的广泛部署，公民的敏感信息和个人数据隐私变得更加脆弱。在这种情况下，红外热成像技术成为数据脱敏的必要手段，即消除敏感数据以保护个人隐私。同时，考虑到工业环境分辨率低、噪声大、物体特征不清晰等问题，热成像技术在工业领域也能发挥重要作用。此外，现有的研究通常将整个视频作为一个整体进行处理，从而忽略了不同时间发生的个别动作，导致鲁棒性不理想。针对这一问题，我们在本文中提出了一种利用人体骨骼在热红外视频中进行动作识别的轻量级算法。我们的方法包括用于目标检测的 YOLOv7-tiny、用于姿势估计的 Alphapose、动态骨架建模以及用于动作预测的时空特征提取的图卷积网络（GCN）。为了克服检测和姿势方面的挑战，我们创建了 OQ35-human 和 OQ35-keypoint 数据集进行训练。此外，通过使用可见光谱数据进行 GCN 训练，所提出的模型增强了鲁棒性。此外，我们还引入了双流移位图卷积网络，以提高动作识别的准确性。我们在定制热红外动作数据集（InfAR-skeleton）上的实验结果表明，Top-1 准确率为 88.06%，Top-5 准确率为 98.28%。在过滤动力学骨架数据集上，该算法的 Top-1 准确率为 55.26%，Top-5 准确率为 83.98%。热红外动作识别技术在满足动作识别要求的同时，确保了对个人隐私的保护。

{"title":"Thermal infrared action recognition with two-stream shift Graph Convolutional Network","authors":"Jishi Liu, Huanyu Wang, Junnian Wang, Dalin He, Ruihan Xu, Xiongfeng Tang","doi":"10.1007/s00138-024-01550-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01550-2","url":null,"abstract":"The extensive deployment of camera-based IoT devices in our society is heightening the vulnerability of citizens’ sensitive information and individual data privacy. In this context, thermal imaging techniques become essential for data desensitization, entailing the elimination of sensitive data to safeguard individual privacy. Meanwhile, thermal imaging techniques can also play a important role in industry by considering the industrial environment with low resolution, high noise and unclear objects’ features. Moreover, existing works often process the entire video as a single entity, which results in suboptimal robustness by overlooking individual actions occurring at different times. In this paper, we propose a lightweight algorithm for action recognition in thermal infrared videos using human skeletons to address this. Our approach includes YOLOv7-tiny for target detection, Alphapose for pose estimation, dynamic skeleton modeling, and Graph Convolutional Networks (GCN) for spatial-temporal feature extraction in action prediction. To overcome detection and pose challenges, we created OQ35-human and OQ35-keypoint datasets for training. Besides, the proposed model enhances robustness by using visible spectrum data for GCN training. Furthermore, we introduce the two-stream shift Graph Convolutional Network to improve the action recognition accuracy. Our experimental results on the custom thermal infrared action dataset (InfAR-skeleton) demonstrate Top-1 accuracy of 88.06% and Top-5 accuracy of 98.28%. On the filtered kinetics-skeleton dataset, the algorithm achieves Top-1 accuracy of 55.26% and Top-5 accuracy of 83.98%. Thermal Infrared Action Recognition ensures the protection of individual privacy while meeting the requirements of action recognition.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"17 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi sentence description of complex manipulation action videos 复杂操作动作视频的多句描述

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-09 DOI: 10.1007/s00138-024-01547-x

Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter

Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.

自动视频描述需要生成能概括视频中的动作、事件和对象的自然语言语句。人类描述视频的一项基本能力是改变细节层次，而现有的自动视频描述方法通常会生成单一、固定层次的细节句子，往往会忽略这一特点。这项研究深入探讨了操作动作的视频描述，在视频描述中，不同层次的细节对于传达动作的层次结构信息至关重要，这也与当代的机器人学习技术息息相关。我们最初提出了两个框架：混合统计模型和端到端方法。混合方法需要的数据量要少得多，它对视频片段中的不确定性进行统计建模。相反，端到端方法则需要更多数据，它绕过任何统计处理，在视觉编码器和语言解码器之间建立直接联系。此外，我们还引入了一种集成方法，旨在综合混合统计方法和端到端方法的优势，提高视频描述在不同数据可用性场景下的适应性和深度。所有这三种框架都利用 LSTM 堆栈来促进描述粒度，允许通过简洁的单句或精致的多句叙述来描述视频。定量结果表明，与其他竞争方法相比，这些方法能生成更逼真的描述。

{"title":"Multi sentence description of complex manipulation action videos","authors":"Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija Tamosiunaite, Florentin Wörgötter","doi":"10.1007/s00138-024-01547-x","DOIUrl":"https://doi.org/10.1007/s00138-024-01547-x","url":null,"abstract":"Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"43 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lightweight segmentation algorithm of feasible area and targets of unmanned surface cleaning vessels 无人水面清洁船可行区域和目标的轻量级分割算法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-08 DOI: 10.1007/s00138-024-01537-z

Jingfu Shen, Yuanliang Zhang, Feiyue Liu, Chun Liu

To achieve real-time segmentation with accurate delineation for feasible areas and target recognition in Unmanned Surface Cleaning Vessel (USCV) image processing, a segmentation approach leveraging visual sensors on USCVs was developed. Initial data collection was executed with remote-controlled cleaning vessels, followed by data cleansing, image deduplication, and manual selection. This led to the creation of WaterSeg dataset, tailored for segmentation tasks in USCV contexts. Upon comparing various deep learning-driven semantic segmentation techniques, a novel, efficient Muti-Cascade Semantic Segmentation Network (MCSSNet) emerged. Comprehensive tests demonstrated that, relative to the state of the art, MCSSNet achieved an average accuracy of 90.64%, a segmentation speed of 44.55fps, and a 45% reduction in model parameters.

为了在无人水面清洁船（USCV）图像处理中实现实时分割，准确划分可行区域和识别目标，开发了一种利用 USCV 上视觉传感器的分割方法。最初的数据收集是通过遥控清洁船进行的，随后进行了数据清理、图像重复数据删除和人工选择。由此创建了 WaterSeg 数据集，专为 USCV 环境下的分割任务定制。在对各种深度学习驱动的语义分割技术进行比较后，一种新颖、高效的多级联语义分割网络（Muti-Cascade Semantic Segmentation Network，MCSSNet）应运而生。综合测试表明，与现有技术相比，MCSSNet 的平均准确率达到了 90.64%，分割速度为 44.55fps，模型参数减少了 45%。

引用次数: 0

CAMTrack: a combined appearance-motion method for multiple-object tracking CAMTrack：用于多目标跟踪的外观与运动相结合的方法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-07 DOI: 10.1007/s00138-024-01548-w

Duy Cuong Bui, Ngan Linh Nguyen, Anh Hiep Hoang, Myungsik Yoo

Object tracking has emerged as an essential process for various applications in the field of computer vision, such as autonomous driving. Recently, object tracking technology has experienced rapid growth, particularly its applications in self-driving vehicles. Tracking systems typically follow the detection-based tracking paradigm, which is affected by the detection results. Although deep learning has led to significant improvements in object detection, data association remains dependent on factors such as spatial location, motion, and appearance, to associate new observations with existing tracks. In this study, we introduce a novel approach called Combined Appearance-Motion Tracking (CAMTrack) to enhance data association by integrating object appearances and their corresponding movements. The proposed tracking method utilizes an appearance-motion model using an appearance-affinity network and an Interactive Multiple Model (IMM). We deploy the appearance model to address the visual affinity between objects across frames and employed the motion model to incorporate motion constraints to obtain robust position predictions under maneuvering movements. Moreover, we also propose a Two-phase association algorithm which is an effective way to recover lost tracks back from previous frames. CAMTrack was evaluated on the widely recognized object tracking benchmarks-KITTI and MOT17. The results showed the superior performance of the proposed method, highlighting its potential to contribute to advances in object tracking.

物体跟踪已成为计算机视觉领域各种应用（如自动驾驶）的必要过程。最近，物体跟踪技术经历了快速发展，尤其是在自动驾驶汽车中的应用。跟踪系统通常遵循基于检测的跟踪范式，这受到检测结果的影响。虽然深度学习已在物体检测方面取得了重大改进，但数据关联仍然依赖于空间位置、运动和外观等因素，以便将新的观测结果与现有轨迹关联起来。在本研究中，我们引入了一种名为 "组合外观-运动跟踪（CAMTrack）"的新方法，通过整合物体外观及其相应的运动来增强数据关联。所提出的跟踪方法利用了一个外观-运动模型，该模型使用了一个外观-亲和网络和一个交互式多重模型（IMM）。我们利用外观模型来解决跨帧物体之间的视觉亲和性问题，并利用运动模型纳入运动约束，从而在机动运动中获得稳健的位置预测。此外，我们还提出了一种两阶段关联算法，这是一种从先前帧中恢复丢失轨迹的有效方法。CAMTrack 在广泛认可的物体跟踪基准--KITTI 和 MOT17 上进行了评估。结果表明，所提出的方法性能优越，凸显了其在促进物体追踪进步方面的潜力。

{"title":"CAMTrack: a combined appearance-motion method for multiple-object tracking","authors":"Duy Cuong Bui, Ngan Linh Nguyen, Anh Hiep Hoang, Myungsik Yoo","doi":"10.1007/s00138-024-01548-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01548-w","url":null,"abstract":"Object tracking has emerged as an essential process for various applications in the field of computer vision, such as autonomous driving. Recently, object tracking technology has experienced rapid growth, particularly its applications in self-driving vehicles. Tracking systems typically follow the detection-based tracking paradigm, which is affected by the detection results. Although deep learning has led to significant improvements in object detection, data association remains dependent on factors such as spatial location, motion, and appearance, to associate new observations with existing tracks. In this study, we introduce a novel approach called Combined Appearance-Motion Tracking (CAMTrack) to enhance data association by integrating object appearances and their corresponding movements. The proposed tracking method utilizes an appearance-motion model using an appearance-affinity network and an Interactive Multiple Model (IMM). We deploy the appearance model to address the visual affinity between objects across frames and employed the motion model to incorporate motion constraints to obtain robust position predictions under maneuvering movements. Moreover, we also propose a Two-phase association algorithm which is an effective way to recover lost tracks back from previous frames. CAMTrack was evaluated on the widely recognized object tracking benchmarks-KITTI and MOT17. The results showed the superior performance of the proposed method, highlighting its potential to contribute to advances in object tracking.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"122 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Residual feature learning with hierarchical calibration for gaze estimation 残差特征学习与分层校准，用于凝视估计

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-05 DOI: 10.1007/s00138-024-01545-z

Zhengdan Yin, Sanping Zhou, Le Wang, Tao Dai, Gang Hua, Nanning Zheng

Gaze estimation aims to predict accurate gaze direction from natural eye images, which is an extreme challenging task due to both random variations in head pose and person-specific biases. Existing works often independently learn features from binocular images and directly concatenate them for gaze estimation. In this paper, we propose a simple yet effective two-stage framework for gaze estimation, in which both residual feature learning (RFL) and hierarchical gaze calibration (HGC) networks are designed to consistently improve the performance of gaze estimation. Specifically, the RFL network extracts informative features by jointly exploring the symmetric and asymmetric factors between left and right eyes, which can produce accurate initial predictions as much as possible. Besides, the HGC network cascades a personal-specific transform module to further transform the distribution of gaze point from coarse to fine, which can effectively compensate the subjective bias in initial predictions. Extensive experiments on both EVE and MPIIGaze datasets show that our method outperforms the state-of-the-art approaches.

注视估计旨在从自然眼球图像中预测准确的注视方向，由于头部姿势的随机变化和特定人的偏差，这是一项极具挑战性的任务。现有的工作通常是独立地从双目图像中学习特征，然后直接串联起来进行注视估计。在本文中，我们提出了一个简单而有效的两阶段凝视估计框架，其中残差特征学习（RFL）和分层凝视校准（HGC）网络的设计能持续提高凝视估计的性能。具体来说，残差特征学习网络通过联合探索左右眼的对称和不对称因素来提取信息特征，从而尽可能产生准确的初始预测。此外，HGC 网络级联了个人特定的变换模块，进一步对注视点的分布进行由粗到细的变换，从而有效弥补了初始预测的主观偏差。在 EVE 和 MPIIGaze 数据集上进行的大量实验表明，我们的方法优于最先进的方法。

{"title":"Residual feature learning with hierarchical calibration for gaze estimation","authors":"Zhengdan Yin, Sanping Zhou, Le Wang, Tao Dai, Gang Hua, Nanning Zheng","doi":"10.1007/s00138-024-01545-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01545-z","url":null,"abstract":"Gaze estimation aims to predict accurate gaze direction from natural eye images, which is an extreme challenging task due to both random variations in head pose and person-specific biases. Existing works often independently learn features from binocular images and directly concatenate them for gaze estimation. In this paper, we propose a simple yet effective two-stage framework for gaze estimation, in which both residual feature learning (RFL) and hierarchical gaze calibration (HGC) networks are designed to consistently improve the performance of gaze estimation. Specifically, the RFL network extracts informative features by jointly exploring the symmetric and asymmetric factors between left and right eyes, which can produce accurate initial predictions as much as possible. Besides, the HGC network cascades a personal-specific transform module to further transform the distribution of gaze point from coarse to fine, which can effectively compensate the subjective bias in initial predictions. Extensive experiments on both EVE and MPIIGaze datasets show that our method outperforms the state-of-the-art approaches.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A tree-based approach for visible and thermal sensor fusion in winter autonomous driving 基于树的冬季自动驾驶可见光和热传感器融合方法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-05-03 DOI: 10.1007/s00138-024-01546-y

Jonathan Boisclair, Ali Amamou, Sousso Kelouwani, M. Zeshan Alam, Hedi Oueslati, Lotfi Zeghmi, Kodjo Agbossou

Research on autonomous vehicles has been at a peak recently. One of the most researched aspects is the performance degradation of sensors in harsh weather conditions such as rain, snow, fog, and hail. This work addresses this performance degradation by fusing multiple sensor modalities inside the neural network used for detection. The proposed fusion method removes the pre-process fusion stage. It directly produces detection boxes from numerous images. It reduces the computation cost by providing detection and fusion simultaneously. By separating the network during the initial layers, the network can easily be modified for new sensors. Intra-network fusion improves robustness to missing inputs and applies to all compatible types of inputs while reducing the peak computing cost by using a valley-fill algorithm. Our experiments demonstrate that adopting a parallel multimodal network to fuse thermal images in the network improves object detection during difficult weather conditions such as harsh winters by up to 5% mAP while reducing dataset bias during complicated weather conditions. It also happens with around 50% fewer parameters than late-fusion approaches, which duplicate the whole network instead of the first section of the feature extractor.

最近，有关自动驾驶汽车的研究进入了高峰期。研究最多的一个方面是传感器在雨、雪、雾和冰雹等恶劣天气条件下的性能下降。这项研究通过在用于检测的神经网络中融合多种传感器模式来解决性能下降问题。所提出的融合方法取消了预处理融合阶段。它直接从众多图像中生成检测框。它通过同时提供检测和融合功能来降低计算成本。通过在初始层分离网络，网络可以很容易地针对新的传感器进行修改。网络内融合提高了对缺失输入的鲁棒性，并适用于所有兼容的输入类型，同时通过使用填谷算法降低了峰值计算成本。我们的实验证明，采用并行多模态网络将热图像融合到网络中，可在严冬等恶劣天气条件下提高物体检测率高达 5%，同时减少复杂天气条件下的数据集偏差。与复制整个网络而不是特征提取器第一部分的后期融合方法相比，这种方法的参数也减少了约 50%。

{"title":"A tree-based approach for visible and thermal sensor fusion in winter autonomous driving","authors":"Jonathan Boisclair, Ali Amamou, Sousso Kelouwani, M. Zeshan Alam, Hedi Oueslati, Lotfi Zeghmi, Kodjo Agbossou","doi":"10.1007/s00138-024-01546-y","DOIUrl":"https://doi.org/10.1007/s00138-024-01546-y","url":null,"abstract":"Research on autonomous vehicles has been at a peak recently. One of the most researched aspects is the performance degradation of sensors in harsh weather conditions such as rain, snow, fog, and hail. This work addresses this performance degradation by fusing multiple sensor modalities inside the neural network used for detection. The proposed fusion method removes the pre-process fusion stage. It directly produces detection boxes from numerous images. It reduces the computation cost by providing detection and fusion simultaneously. By separating the network during the initial layers, the network can easily be modified for new sensors. Intra-network fusion improves robustness to missing inputs and applies to all compatible types of inputs while reducing the peak computing cost by using a valley-fill algorithm. Our experiments demonstrate that adopting a parallel multimodal network to fuse thermal images in the network improves object detection during difficult weather conditions such as harsh winters by up to 5% mAP while reducing dataset bias during complicated weather conditions. It also happens with around 50% fewer parameters than late-fusion approaches, which duplicate the whole network instead of the first section of the feature extractor.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"4 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust semantic segmentation method of urban scenes in snowy environment 雪地环境中城市场景的稳健语义分割方法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-04-29 DOI: 10.1007/s00138-024-01540-4

Hanqi Yin, Guisheng Yin, Yiming Sun, Liguo Zhang, Ye Tian

Semantic segmentation plays a crucial role in various computer vision tasks, such as autonomous driving in urban scenes. The related researches have made significant progress. However, since most of the researches focus on how to enhance the performance of semantic segmentation models, there is a noticeable lack of attention given to the performance deterioration of these models in severe weather. To address this issue, we study the robustness of the multimodal semantic segmentation model in snowy environment, which represents a subset of severe weather conditions. The proposed method generates realistically simulated snowy environment images by combining unpaired image translation with adversarial snowflake generation, effectively misleading the segmentation model’s predictions. These generated adversarial images are then utilized for model robustness learning, enabling the model to adapt to the harshest snowy environment and enhancing its robustness to artificially adversarial perturbance to some extent. The experimental visualization results show that the proposed method can generate approximately realistic snowy environment images, and yield satisfactory visual effects for both daytime and nighttime scenes. Moreover, the experimental quantitation results generated by MFNet Dataset indicate that compared with the model without enhancement, the proposed method achieves average improvements of 4.82% and 3.95% on mAcc and mIoU, respectively. These improvements enhance the adaptability of the multimodal semantic segmentation model to snowy environments and contribute to road safety. Furthermore, the proposed method demonstrates excellent applicability, as it can be seamlessly integrated into various multimodal semantic segmentation models.

语义分割在各种计算机视觉任务（如城市场景中的自动驾驶）中发挥着至关重要的作用。相关研究已取得重大进展。然而，由于大多数研究都集中在如何提高语义分割模型的性能上，因此明显缺乏对这些模型在恶劣天气下性能下降问题的关注。为了解决这个问题，我们研究了多模态语义分割模型在雪地环境中的鲁棒性，雪地环境是恶劣天气条件的一个子集。所提出的方法通过将非配对图像翻译与对抗雪花生成相结合，生成真实的模拟雪地环境图像，有效地误导了分割模型的预测。然后利用这些生成的对抗图像进行模型鲁棒性学习，使模型能够适应最恶劣的雪地环境，并在一定程度上增强其对人为对抗扰动的鲁棒性。可视化实验结果表明，所提出的方法可以生成近似真实的雪地环境图像，并在白天和夜间场景中都能产生令人满意的视觉效果。此外，由 MFNet 数据集生成的实验量化结果表明，与未增强的模型相比，提出的方法在 mAcc 和 mIoU 上分别实现了 4.82% 和 3.95% 的平均改进。这些改进提高了多模态语义分割模型对雪地环境的适应性，有助于道路安全。此外，所提出的方法还可以无缝集成到各种多模态语义分割模型中，因此具有出色的适用性。

{"title":"Robust semantic segmentation method of urban scenes in snowy environment","authors":"Hanqi Yin, Guisheng Yin, Yiming Sun, Liguo Zhang, Ye Tian","doi":"10.1007/s00138-024-01540-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01540-4","url":null,"abstract":"Semantic segmentation plays a crucial role in various computer vision tasks, such as autonomous driving in urban scenes. The related researches have made significant progress. However, since most of the researches focus on how to enhance the performance of semantic segmentation models, there is a noticeable lack of attention given to the performance deterioration of these models in severe weather. To address this issue, we study the robustness of the multimodal semantic segmentation model in snowy environment, which represents a subset of severe weather conditions. The proposed method generates realistically simulated snowy environment images by combining unpaired image translation with adversarial snowflake generation, effectively misleading the segmentation model’s predictions. These generated adversarial images are then utilized for model robustness learning, enabling the model to adapt to the harshest snowy environment and enhancing its robustness to artificially adversarial perturbance to some extent. The experimental visualization results show that the proposed method can generate approximately realistic snowy environment images, and yield satisfactory visual effects for both daytime and nighttime scenes. Moreover, the experimental quantitation results generated by MFNet Dataset indicate that compared with the model without enhancement, the proposed method achieves average improvements of 4.82% and 3.95% on mAcc and mIoU, respectively. These improvements enhance the adaptability of the multimodal semantic segmentation model to snowy environments and contribute to road safety. Furthermore, the proposed method demonstrates excellent applicability, as it can be seamlessly integrated into various multimodal semantic segmentation models.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"8 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140827644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0