Machine Vision and Applications最新文献_第2页

Temporal superimposed crossover module for effective continuous sign language 有效连续手语的时空叠加交叉模块

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-19 DOI: 10.1007/s00138-024-01595-3

Qidan Zhu, Jing Li, Fei Yuan, Quan Gan

The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form "TSCM+2D" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply "TSCM+2D" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.

连续手语识别的最终目标是促进特殊人群与正常人之间的交流，这就对模型的实时性和可部署性提出了很高的要求。然而，在以往的 CSLR 研究中，研究人员很少关注这两个特性。本文提出了一种基于时空叠加交叉模块和 ResNet 的新型 CSLR 模型 ResNetT，该模型以时空维度的移动取代了参数化计算，在不增加参数和计算量的情况下高效提取时空特征。ResNetT 能够提高模型的实时性能和可部署性，同时确保其准确性。其核心是我们提出的零参数、零计算模块 TSCM，并将 TSCM 与二维卷积相结合，形成 "TSCM+2D "混合卷积，与其他时空卷积相比，具有强大的时空建模能力、零参数增加和更低的部署成本。此外，我们将 "TSCM+2D "应用于 ResBlock，形成新的 ResBlockT，这是新型 CSLR 模型 ResNetT 的基础。我们在训练该模型时引入了随机梯度停止和多级连接时序分类（CTC）损失，从而减少了训练内存的使用，同时降低了最终识别的词错误率（WER），并将 ResNet 网络从图像分类任务扩展到视频识别任务。此外，本研究还是 CSLR 领域首次仅使用二维卷积来提取手语视频的时空特征，从而实现端到端的识别学习。在两个大规模连续手语数据集上的实验证明了该方法的高效性。

{"title":"Temporal superimposed crossover module for effective continuous sign language","authors":"Qidan Zhu, Jing Li, Fei Yuan, Quan Gan","doi":"10.1007/s00138-024-01595-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01595-3","url":null,"abstract":"The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form \"TSCM+2D\" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply \"TSCM+2D\" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dyna-MSDepth: multi-scale self-supervised monocular depth estimation network for visual SLAM in dynamic scenes Dyna-MSDepth：用于动态场景中视觉 SLAM 的多尺度自监督单目深度估计网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-19 DOI: 10.1007/s00138-024-01586-4

Jianjun Yao, Yingzhao Li, Jiajia Li

Monocular Simultaneous Localization And Mapping (SLAM) suffers from scale drift, leading to tracking failure due to scale ambiguity. Deep learning has significantly advanced self-supervised monocular depth estimation, enabling scale drift reduction. Nonetheless, current self-supervised learning approaches fail to provide scale-consistent depth maps, estimate depth in dynamic environments, or perceive multi-scale information. In response to these limitations, this paper proposes Dyna-MSDepth, a novel method for estimating multi-scale, stable, and reliable depth maps in dynamic environments. Dyna-MSDepth incorporates multi-scale high-order spatial semantic interaction into self-supervised training. This integration enhances the model’s capacity to discern intricate texture nuances and distant depth cues. Dyna-MSDepth is evaluated on challenging dynamic datasets, including KITTI, TUM, BONN, and DDAD, employing rigorous qualitative evaluations and quantitative experiments. Furthermore, the accuracy of the depth maps estimated by Dyna-MSDepth is assessed in monocular SLAM. Extensive experiments confirm the superior multi-scale depth estimation capabilities of Dyna-MSDepth, highlighting its significant value in dynamic environments. Code is available at https://github.com/Pepper-FlavoredChewingGum/Dyna-MSDepth.

单目同时定位与映射（SLAM）存在尺度漂移问题，会因尺度模糊而导致跟踪失败。深度学习大大推进了自监督单目深度估算，从而减少了尺度漂移。然而，目前的自监督学习方法无法提供尺度一致的深度图，无法估计动态环境中的深度，也无法感知多尺度信息。针对这些局限性，本文提出了一种在动态环境中估算多尺度、稳定可靠的深度图的新方法--Dyna-MSDepth。Dyna-MSDepth 将多尺度高阶空间语义交互纳入自我监督训练。这种整合增强了模型辨别复杂纹理细微差别和远距离深度线索的能力。通过严格的定性评估和定量实验，Dyna-MSDepth 在 KITTI、TUM、BONN 和 DDAD 等具有挑战性的动态数据集上进行了评估。此外，Dyna-MSDepth 估算的深度图的准确性还在单目 SLAM 中进行了评估。大量实验证实了 Dyna-MSDepth 卓越的多尺度深度估算能力，凸显了其在动态环境中的重要价值。代码见 https://github.com/Pepper-FlavoredChewingGum/Dyna-MSDepth。

{"title":"Dyna-MSDepth: multi-scale self-supervised monocular depth estimation network for visual SLAM in dynamic scenes","authors":"Jianjun Yao, Yingzhao Li, Jiajia Li","doi":"10.1007/s00138-024-01586-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01586-4","url":null,"abstract":"Monocular Simultaneous Localization And Mapping (SLAM) suffers from scale drift, leading to tracking failure due to scale ambiguity. Deep learning has significantly advanced self-supervised monocular depth estimation, enabling scale drift reduction. Nonetheless, current self-supervised learning approaches fail to provide scale-consistent depth maps, estimate depth in dynamic environments, or perceive multi-scale information. In response to these limitations, this paper proposes Dyna-MSDepth, a novel method for estimating multi-scale, stable, and reliable depth maps in dynamic environments. Dyna-MSDepth incorporates multi-scale high-order spatial semantic interaction into self-supervised training. This integration enhances the model’s capacity to discern intricate texture nuances and distant depth cues. Dyna-MSDepth is evaluated on challenging dynamic datasets, including KITTI, TUM, BONN, and DDAD, employing rigorous qualitative evaluations and quantitative experiments. Furthermore, the accuracy of the depth maps estimated by Dyna-MSDepth is assessed in monocular SLAM. Extensive experiments confirm the superior multi-scale depth estimation capabilities of Dyna-MSDepth, highlighting its significant value in dynamic environments. Code is available at https://github.com/Pepper-FlavoredChewingGum/Dyna-MSDepth.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"42 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cmf-transformer: cross-modal fusion transformer for human action recognition Cmf-转换器：用于人类动作识别的跨模态融合转换器

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-17 DOI: 10.1007/s00138-024-01598-0

Jun Wang, Limin Xia, Xin Wen

In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.

在人类动作识别中，单独使用时空视频和骨架特征都能获得良好的识别性能，但如何将这两种模态结合起来以获得更好的性能仍是一个值得研究的方向。为了更好地结合这两种模态，我们提出了一种用于人类动作识别的新型跨模态变换器--CMF-Transformer，它能有效地融合两种不同的模态。在时空模态中，视频帧被用作输入，变换器使用方向注意来获得不同时空块之间的识别顺序。在骨架关节模态中，骨架关节被用作输入，通过转换器中的时空交叉注意来探索不同骨架关节中更完整的相关性。随后，采用多模态协同识别策略，分别识别两种模态的各自特征和连接特征，然后对识别结果分别加权，通过融合两种模态下的特征来协同识别目标动作。在三个基准数据集上进行的一系列实验表明，本文中的 CMF-Transformer 的性能优于目前大多数最先进的方法。

{"title":"Cmf-transformer: cross-modal fusion transformer for human action recognition","authors":"Jun Wang, Limin Xia, Xin Wen","doi":"10.1007/s00138-024-01598-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01598-0","url":null,"abstract":"In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An efficient driving behavior prediction approach using physiological auxiliary and adaptive LSTM 使用生理辅助和自适应 LSTM 的高效驾驶行为预测方法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-14 DOI: 10.1007/s00138-024-01600-9

Jun Gao, Jiangang Yi, Yi Lu Murphey

Driving behavior prediction is crucial in designing a modern Advanced driver assistance system (ADAS). Such predictions can improve driving safety by alerting the driver to the danger of unsafe or risky traffic situations. In this research, an efficient approach, Driver behavior network (DBNet) is proposed for driving behavior prediction using multiple modality data, i.e. front view video frames and driver physiological signals. Firstly, a Relation-guided spatial attention (RGSA) module is adopted to generate driving scene-centric features by modeling both local and global information from video frames. Secondly, a new Global shrinkage (GS) block is designed to incorporate soft thresholding as nonlinear transformation layer to generate physiological features and eliminate noise-related information from physiological signals. Finally, a customized Adaptive focal loss based Long short term memory (AFL-LSTM) network is introduced to learn the multi-modal features and capture the dependencies within driving behaviors simultaneously. We applied our approach on real data collected during drives in both urban and freeway environment in an instrumented vehicle. The experimental findings demonstrate that the DBNet can predict the upcoming driving behavior efficiently and significantly outperform other state-of-the-art models.

驾驶行为预测是设计现代高级驾驶辅助系统（ADAS）的关键。这种预测可以提醒驾驶员注意不安全或危险的交通状况，从而提高驾驶安全性。本研究提出了一种有效的方法--驾驶员行为网络（DBNet），利用多种模态数据（即前视视频帧和驾驶员生理信号）进行驾驶行为预测。首先，采用关系引导空间注意力（RGSA）模块，通过对视频帧的局部和全局信息建模，生成以驾驶场景为中心的特征。其次，设计了一个新的全局收缩（GS）模块，将软阈值作为非线性变换层来生成生理特征，并消除生理信号中与噪声相关的信息。最后，我们引入了一个定制的基于自适应焦点损耗的长短期记忆（AFL-LSTM）网络来学习多模态特征，并同时捕捉驾驶行为中的依赖关系。我们将这一方法应用于在城市和高速公路环境中通过仪器车辆收集到的真实驾驶数据。实验结果表明，DBNet 可以有效预测即将发生的驾驶行为，并明显优于其他最先进的模型。

{"title":"An efficient driving behavior prediction approach using physiological auxiliary and adaptive LSTM","authors":"Jun Gao, Jiangang Yi, Yi Lu Murphey","doi":"10.1007/s00138-024-01600-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01600-9","url":null,"abstract":"Driving behavior prediction is crucial in designing a modern Advanced driver assistance system (ADAS). Such predictions can improve driving safety by alerting the driver to the danger of unsafe or risky traffic situations. In this research, an efficient approach, Driver behavior network (DBNet) is proposed for driving behavior prediction using multiple modality data, i.e. front view video frames and driver physiological signals. Firstly, a Relation-guided spatial attention (RGSA) module is adopted to generate driving scene-centric features by modeling both local and global information from video frames. Secondly, a new Global shrinkage (GS) block is designed to incorporate soft thresholding as nonlinear transformation layer to generate physiological features and eliminate noise-related information from physiological signals. Finally, a customized Adaptive focal loss based Long short term memory (AFL-LSTM) network is introduced to learn the multi-modal features and capture the dependencies within driving behaviors simultaneously. We applied our approach on real data collected during drives in both urban and freeway environment in an instrumented vehicle. The experimental findings demonstrate that the DBNet can predict the upcoming driving behavior efficiently and significantly outperform other state-of-the-art models.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"42 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust visual-based method and new datasets for ego-lane index estimation in urban environment 用于估算城市环境中自我车道指数的基于视觉的稳健方法和新数据集

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-14 DOI: 10.1007/s00138-024-01590-8

Dianzheng Wang, Dongyi Liang, Shaomiao Li

Correct and robust ego-lane index estimation is crucial for autonomous driving in the absence of high-definition maps, especially in urban environments. Previous ego-lane index estimation approaches rely on feature extraction, which limits the robustness. To overcome these shortages, this study proposes a robust ego-lane index estimation framework upon only the original visual image. After optimization of the processing route, the raw image was randomly cropped in the height direction and then input into a double supervised LaneLoc network to obtain the index estimations and confidences. A post-process was also proposed to achieve the global ego-lane index from the estimated left and right indexes with the total lane number. To evaluate our proposed method, we manually annotated the ego-lane index of public datasets which can work as an ego-lane index estimation baseline for the first time. The proposed algorithm achieved 96.48/95.40% (precision/recall) on the CULane dataset and 99.45/99.49% (precision/recall) on the TuSimple dataset, demonstrating the effectiveness and efficiency of lane localization in diverse driving environments. The code and dataset annotation results will be exposed publicly on https://github.com/haomo-ai/LaneLoc.

在没有高清地图的情况下，尤其是在城市环境中，正确、稳健的自我车道指数估计对于自动驾驶至关重要。以往的自我车道指数估计方法依赖于特征提取，这限制了其鲁棒性。为了克服这些不足，本研究提出了一种仅基于原始视觉图像的鲁棒自我车道指数估算框架。在优化处理路径后，原始图像在高度方向被随机裁剪，然后输入双重监督的 LaneLoc 网络，以获得指数估计值和可信度。此外，我们还提出了一种后处理方法，通过估算出的左侧和右侧指数以及总车道数来获得全局自我车道指数。为了评估我们提出的方法，我们首次对可作为自我车道指数估计基准的公共数据集的自我车道指数进行了人工标注。所提出的算法在 CULane 数据集上实现了 96.48%/95.40% 的精度/召回率，在 TuSimple 数据集上实现了 99.45%/99.49% 的精度/召回率，证明了在不同驾驶环境下车道定位的有效性和高效性。代码和数据集注释结果将在 https://github.com/haomo-ai/LaneLoc 上公开发布。

{"title":"Robust visual-based method and new datasets for ego-lane index estimation in urban environment","authors":"Dianzheng Wang, Dongyi Liang, Shaomiao Li","doi":"10.1007/s00138-024-01590-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01590-8","url":null,"abstract":"Correct and robust ego-lane index estimation is crucial for autonomous driving in the absence of high-definition maps, especially in urban environments. Previous ego-lane index estimation approaches rely on feature extraction, which limits the robustness. To overcome these shortages, this study proposes a robust ego-lane index estimation framework upon only the original visual image. After optimization of the processing route, the raw image was randomly cropped in the height direction and then input into a double supervised LaneLoc network to obtain the index estimations and confidences. A post-process was also proposed to achieve the global ego-lane index from the estimated left and right indexes with the total lane number. To evaluate our proposed method, we manually annotated the ego-lane index of public datasets which can work as an ego-lane index estimation baseline for the first time. The proposed algorithm achieved 96.48/95.40% (precision/recall) on the CULane dataset and 99.45/99.49% (precision/recall) on the TuSimple dataset, demonstrating the effectiveness and efficiency of lane localization in diverse driving environments. The code and dataset annotation results will be exposed publicly on https://github.com/haomo-ai/LaneLoc.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFFAE-Net: semantic segmentation of point clouds using multi-scale feature fusion and attention enhancement networks MFFAE-Net：利用多尺度特征融合和注意力增强网络进行点云语义分割

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-12 DOI: 10.1007/s00138-024-01589-1

Wei Liu, Yisheng Lu, Tao Zhang

Point cloud data can reflect more information about the real 3D space, which has gained increasing attention in computer vision field. But the unstructured and unordered nature of point clouds poses many challenges in their study. How to learn the global features of the point cloud in the original point cloud is a problem that has been accompanied by the research. In the research based on the structure of the encoder and decoder, many researchers focus on designing the encoder to better extract features, and do not further explore more globally representative features according to the features of the encoder and decoder. To solve this problem, we propose the MFFAE-Net method, which aims to obtain more globally representative point cloud features by using the feature learning of encoder decoder stage.Our method first enhances the feature information of the input point cloud by merging the information of its neighboring points, which is helpful for the following point cloud feature extraction work. Secondly, the channel attention module is used to further process the extracted features, so as to highlight the role of important channels in the features. Finally, we fuse features of different scales from encoding features and decoding features as well as features of the same scale, so as to obtain more global point cloud features, which will help improve the segmentation results of point clouds. Experimental results show that the method performs well on some objects in S3DIS dataset and Toronto3d dataset.

点云数据可以反映真实三维空间的更多信息，在计算机视觉领域越来越受到关注。但是，点云的非结构化和无序性给点云的研究带来了诸多挑战。如何在原始点云中学习点云的全局特征是一直伴随着研究的问题。在基于编码器和解码器结构的研究中，很多研究者只注重设计编码器以更好地提取特征，并没有根据编码器和解码器的特点进一步探索更具全局代表性的特征。为了解决这个问题，我们提出了 MFFAE-Net 方法，旨在利用编码器解码器阶段的特征学习来获得更具全局代表性的点云特征。我们的方法首先通过合并输入点云相邻点的特征信息来增强输入点云的特征信息，这有助于接下来的点云特征提取工作。其次，利用通道关注模块对提取的特征进行进一步处理，从而突出重要通道在特征中的作用。最后，我们将编码特征和解码特征中不同尺度的特征以及相同尺度的特征进行融合，从而获得更多的全局点云特征，这将有助于改善点云的分割结果。实验结果表明，该方法在 S3DIS 数据集和 Toronto3d 数据集中的一些物体上表现良好。

{"title":"MFFAE-Net: semantic segmentation of point clouds using multi-scale feature fusion and attention enhancement networks","authors":"Wei Liu, Yisheng Lu, Tao Zhang","doi":"10.1007/s00138-024-01589-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01589-1","url":null,"abstract":"Point cloud data can reflect more information about the real 3D space, which has gained increasing attention in computer vision field. But the unstructured and unordered nature of point clouds poses many challenges in their study. How to learn the global features of the point cloud in the original point cloud is a problem that has been accompanied by the research. In the research based on the structure of the encoder and decoder, many researchers focus on designing the encoder to better extract features, and do not further explore more globally representative features according to the features of the encoder and decoder. To solve this problem, we propose the MFFAE-Net method, which aims to obtain more globally representative point cloud features by using the feature learning of encoder decoder stage.Our method first enhances the feature information of the input point cloud by merging the information of its neighboring points, which is helpful for the following point cloud feature extraction work. Secondly, the channel attention module is used to further process the extracted features, so as to highlight the role of important channels in the features. Finally, we fuse features of different scales from encoding features and decoding features as well as features of the same scale, so as to obtain more global point cloud features, which will help improve the segmentation results of point clouds. Experimental results show that the method performs well on some objects in S3DIS dataset and Toronto3d dataset.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"8 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adversarial imitation learning-based network for category-level 6D object pose estimation 基于对抗性模仿学习的网络，用于类别级 6D 物体姿态估计

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-12 DOI: 10.1007/s00138-024-01592-6

Shantong Sun, Xu Bao, Aryan Kaushik

Category-level 6D object pose estimation is a very fundamental and key research in computer vision. In order to get rid of the dependence on the object 3D models, analysis-by-synthesis object pose estimation methods have recently been widely studied. While these methods have certain improvements in generalization, the accuracy of category-level object pose estimation still needs to be improved. In this paper, we propose a category-level 6D object pose estimation network based on adversarial imitation learning, named AIL-Net. AIL-Net adopts the state-action distribution matching criterion and is able to perform expert actions that have not appeared in the dataset. This prevents the object pose estimation from falling into a bad state. We further design a framework for estimating object pose through generative adversarial imitation learning. This method is able to distinguish between expert policy and imitation policy in AIL-Net. Experimental results show that our approach achieves competitive category-level object pose estimation performance on REAL275 dataset and Cars dataset.

类别级 6D 物体姿态估计是计算机视觉领域一项非常基础和关键的研究。为了摆脱对物体三维模型的依赖，通过合成分析进行物体姿态估计的方法近年来被广泛研究。虽然这些方法在泛化方面有一定的改进，但类别级物体姿态估计的精度仍有待提高。本文提出了一种基于对抗模仿学习的类别级 6D 物体姿态估计网络，命名为 AIL-Net。AIL-Net 采用状态-动作分布匹配准则，能够执行数据集中未出现过的专家动作。这可以防止物体姿态估计陷入不良状态。我们进一步设计了一个通过生成式对抗模仿学习来估计物体姿态的框架。这种方法能够区分 AIL-Net 中的专家策略和模仿策略。实验结果表明，我们的方法在 REAL275 数据集和 Cars 数据集上实现了具有竞争力的类别级物体姿态估计性能。

引用次数: 0

Active perception based on deep reinforcement learning for autonomous robotic damage inspection 基于深度强化学习的主动感知，用于自主机器人损伤检测

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-12 DOI: 10.1007/s00138-024-01591-7

Wen Tang, Mohammad R. Jahanshahi

In this study, an artificial intelligence framework is developed to facilitate the use of robotics for autonomous damage inspection. While considerable progress has been achieved by utilizing state-of-the-art computer vision approaches for damage detection, these approaches are still far away from being used for autonomous robotic inspection systems due to the uncertainties in data collection and data interpretation. To address this gap, this study proposes a framework that will enable robots to select the best course of action for active damage perception and reduction of uncertainties. By doing so, the required information is collected efficiently for a better understanding of damage severity which leads to reliable decision-making. More specifically, the active damage perception task is formulated as a Partially Observable Markov Decision Process, and a deep reinforcement learning-based active perception agent is proposed to learn the near-optimal policy for this task. The proposed framework is evaluated for the autonomous assessment of cracks on metallic surfaces of an underwater nuclear reactor. Active perception exhibits a notable enhancement in the crack Intersection over Union (IoU) performance, yielding an increase of up to 69% when compared to its raster scanning counterpart given a similar inspection time. Additionally, the proposed method can perform a rapid inspection that reduces the overall inspection time by more than two times while achieving a 15% higher crack IoU than that of the dense raster scanning approach.

本研究开发了一个人工智能框架，以促进机器人技术在自主损伤检测中的应用。虽然利用最先进的计算机视觉方法进行损伤检测已经取得了相当大的进展，但由于数据收集和数据解释方面的不确定性，这些方法距离用于自主机器人检测系统还很遥远。为了弥补这一差距，本研究提出了一个框架，使机器人能够选择最佳行动方案，主动感知损伤并减少不确定性。这样就能有效收集所需信息，更好地了解损坏严重程度，从而做出可靠的决策。更具体地说，主动损伤感知任务被表述为部分可观测马尔可夫决策过程，并提出了一种基于深度强化学习的主动感知代理，以学习该任务的近优策略。针对水下核反应堆金属表面裂缝的自主评估，对所提出的框架进行了评估。主动感知显著提高了裂纹交集（IoU）性能，在检测时间相近的情况下，比光栅扫描提高了 69%。此外，所提出的方法还能进行快速检测，将整体检测时间缩短两倍以上，同时裂纹 IoU 比密集光栅扫描方法高出 15%。

{"title":"Active perception based on deep reinforcement learning for autonomous robotic damage inspection","authors":"Wen Tang, Mohammad R. Jahanshahi","doi":"10.1007/s00138-024-01591-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01591-7","url":null,"abstract":"In this study, an artificial intelligence framework is developed to facilitate the use of robotics for autonomous damage inspection. While considerable progress has been achieved by utilizing state-of-the-art computer vision approaches for damage detection, these approaches are still far away from being used for autonomous robotic inspection systems due to the uncertainties in data collection and data interpretation. To address this gap, this study proposes a framework that will enable robots to select the best course of action for active damage perception and reduction of uncertainties. By doing so, the required information is collected efficiently for a better understanding of damage severity which leads to reliable decision-making. More specifically, the active damage perception task is formulated as a Partially Observable Markov Decision Process, and a deep reinforcement learning-based active perception agent is proposed to learn the near-optimal policy for this task. The proposed framework is evaluated for the autonomous assessment of cracks on metallic surfaces of an underwater nuclear reactor. Active perception exhibits a notable enhancement in the crack Intersection over Union (IoU) performance, yielding an increase of up to 69% when compared to its raster scanning counterpart given a similar inspection time. Additionally, the proposed method can perform a rapid inspection that reduces the overall inspection time by more than two times while achieving a 15% higher crack IoU than that of the dense raster scanning approach.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"96 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An efficient ground segmentation approach for LiDAR point cloud utilizing adjacent grids 利用相邻网格对激光雷达点云进行高效地面分割的方法

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-11 DOI: 10.1007/s00138-024-01593-5

Longyu Dong, Dejun Liu, Youqiang Dong, Bongrae Park, Zhibo Wan

Ground segmentation is crucial for guiding mobile robots and identifying nearby objects. However, it should be noted that the ground often presents complex topographical features, such as slopes and rugged terrains, which significantly increase the challenges associated with accurate ground segmentation tasks. To address this issue, we propose a novel approach to achieve rapid ground segmentation. The proposed method uses a multi-partition approach to extract ground points for each partition, followed by assessing the correction plane based on geometric characteristics of the ground surface and similarity among adjacent planes. An adaptive threshold is also introduced to enhance efficiency in extracting complex urban pavement. Our method was benchmarked against several contemporary techniques on the SemanticKITTI dataset. The precision was elevated by 1.72(%), and the precision deviation was diminished by 1.02(%), culminating in the most accurate and robust outcomes among the evaluated methods.

地面分割对于引导移动机器人和识别附近物体至关重要。然而，需要注意的是，地面通常具有复杂的地形特征，如斜坡和崎岖地形，这大大增加了与精确地面分割任务相关的挑战。为解决这一问题，我们提出了一种实现快速地面分割的新方法。所提出的方法采用多分区方法提取每个分区的地面点，然后根据地面表面的几何特征和相邻平面之间的相似性评估校正平面。此外，还引入了自适应阈值，以提高提取复杂城市路面的效率。我们的方法在 SemanticKITTI 数据集上与几种当代技术进行了基准测试。精确度提高了1.72%，精确度偏差降低了1.02%，在所有评估方法中取得了最精确、最稳健的结果。

引用次数: 0

Boundary enhancement and refinement network for camouflaged object detection 用于伪装物体检测的边界增强和细化网络

IF 3.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications

Pub Date : 2024-08-03 DOI: 10.1007/s00138-024-01588-2

Chenxing Xia, Huizhen Cao, Xiuju Gao, Bin Ge, Kuan-Ching Li, Xianjin Fang, Yan Zhang, Xingzhu Liang

Camouflaged object detection aims to locate and segment objects accurately that conceal themselves well in the environment. Despite the advancements in deep learning methods, prevalent issues persist, including coarse boundary identification in complex scenes and the ineffective integration of multi-source features. To this end, we propose a novel boundary enhancement and refinement network named BERNet, which mainly consists of three modules for enhancing and refining boundary information: an asymmetric edge module (AEM) with multi-groups dilated convolution block (GDCB), a residual mixed pooling enhanced module (RPEM), and a multivariate information interaction refiner module (M2IRM). AEM with GDCB is designed to obtain rich boundary clues, where different dilation rates are used to expand the receptive field. RPEM is capable of enhancing boundary features under the guidance of boundary cues to improve the detection accuracy of small and multiple camouflaged objects. M2IRM is introduced to refine the side-out prediction maps progressively under the supervision of the ground truth by the fusion of multi-source information. Comprehensive experiments on three benchmark datasets demonstrate the effectiveness of our BERNet with competitive state-of-the-art methods under the most evaluation metrics.

伪装物体检测旨在准确定位和分割在环境中隐藏得很好的物体。尽管深度学习方法不断进步，但普遍存在的问题依然存在，包括复杂场景中的粗糙边界识别和多源特征的无效整合。为此，我们提出了一种名为 BERNet 的新型边界增强和细化网络，它主要由三个用于增强和细化边界信息的模块组成：带有多组扩张卷积块（GDCB）的非对称边缘模块（AEM）、残差混合池化增强模块（RPEM）和多变量信息交互细化模块（M2IRM）。带有 GDCB 的 AEM 是为获取丰富的边界线索而设计的，其中使用了不同的扩张率来扩大感受野。RPEM 能够在边界线索的指导下增强边界特征，从而提高对小型和多重伪装物体的检测精度。引入 M2IRM，通过多源信息融合，在地面实况的监督下逐步完善侧出预测图。在三个基准数据集上进行的综合实验证明，在大多数评估指标下，我们的 BERNet 与最先进的竞争方法相比非常有效。

{"title":"Boundary enhancement and refinement network for camouflaged object detection","authors":"Chenxing Xia, Huizhen Cao, Xiuju Gao, Bin Ge, Kuan-Ching Li, Xianjin Fang, Yan Zhang, Xingzhu Liang","doi":"10.1007/s00138-024-01588-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01588-2","url":null,"abstract":"Camouflaged object detection aims to locate and segment objects accurately that conceal themselves well in the environment. Despite the advancements in deep learning methods, prevalent issues persist, including coarse boundary identification in complex scenes and the ineffective integration of multi-source features. To this end, we propose a novel boundary enhancement and refinement network named BERNet, which mainly consists of three modules for enhancing and refining boundary information: an asymmetric edge module (AEM) with multi-groups dilated convolution block (GDCB), a residual mixed pooling enhanced module (RPEM), and a multivariate information interaction refiner module (M2IRM). AEM with GDCB is designed to obtain rich boundary clues, where different dilation rates are used to expand the receptive field. RPEM is capable of enhancing boundary features under the guidance of boundary cues to improve the detection accuracy of small and multiple camouflaged objects. M2IRM is introduced to refine the side-out prediction maps progressively under the supervision of the ground truth by the fusion of multi-source information. Comprehensive experiments on three benchmark datasets demonstrate the effectiveness of our BERNet with competitive state-of-the-art methods under the most evaluation metrics.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"33 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141884135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0