首页 > 最新文献

Pattern Recognition最新文献

英文 中文
Jointly stochastic fully symmetric interpolatory rules and local approximation for scalable Gaussian process regression 用于可扩展高斯过程回归的联合随机全对称插值规则和局部近似法
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-31 DOI: 10.1016/j.patcog.2024.111125
<div><div>When exploring the broad application prospects of large-scale Gaussian process regression (GPR), three core challenges significantly constrain its full effectiveness: firstly, the <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> time complexity of computing the inverse covariance matrix of <span><math><mi>n</mi></math></span> training points becomes an insurmountable performance bottleneck when processing large-scale datasets; Secondly, although traditional local approximation methods are widely used, they are often limited by the inconsistency of prediction results; The third issue is that many aggregation strategies lack discrimination when evaluating the importance of experts (i.e. local models), resulting in a loss of overall prediction accuracy. In response to the above challenges, this article innovatively proposes a comprehensive method that integrates third-degree stochastic fully symmetric interpolatory rules (TDSFSI), local approximation, and Tsallis mutual information (TDSFSIRLA), aiming to fundamentally break through existing limitations. Specifically, TDSFSIRLA first introduces an efficient third-degree stochastic fully symmetric interpolatory rules, which achieves accurate approximation of Gaussian kernel functions by generating adaptive dimensional feature maps. This innovation not only significantly reduces the number of required orthogonal nodes and effectively lowers computational costs, but also maintains extremely high approximation accuracy, providing a solid theoretical foundation for processing large-scale datasets. Furthermore, in order to overcome the inconsistency of local approximation methods, this paper adopts the Generalized Robust Bayesian Committee Machine (GRBCM) as the aggregation framework for local experts. GRBCM ensures the harmonious unity of the prediction results of each local model through its inherent consistency and robustness, significantly improving the stability and reliability of the overall prediction. More importantly, in response to the issue of uneven distribution of expert weights, this article creatively introduces Tsallis mutual information as a metric for weight allocation. Tsallis mutual information, with its sensitive ability to capture information complexity, assigns weights to different local experts that match their contribution, effectively solving the problem of prediction bias caused by uneven weight distribution and further improving prediction accuracy. In the experimental verification phase, this article conducted comprehensive testing on multiple synthetic datasets and seven representative real datasets. The results show that the TDSFSIRLA method not only achieves significant reduction in time complexity, but also demonstrates excellent performance in prediction accuracy, fully verifying its significant advantages and broad application prospects in the field of large-scale Gaussi
在探索大规模高斯过程回归(GPR)的广阔应用前景时,有三个核心挑战极大地制约了它的充分发挥:首先,在处理大规模数据集时,计算 n 个训练点的逆协方差矩阵所需的 O(n3) 时间复杂度成为一个难以克服的性能瓶颈;其次,尽管传统的局部逼近方法得到了广泛应用,但它们往往受限于预测结果的不一致性;第三个问题是,许多聚合策略在评估专家(即局部模型)的重要性时缺乏辨别力,导致整体预测精度下降。即局部模型)的重要性时缺乏辨别力,从而导致整体预测精度的损失。针对上述挑战,本文创新性地提出了一种集成了三度随机全对称插值规则(TDSFSI)、局部逼近和 Tsallis 互信息的综合方法(TDSFSIRLA),旨在从根本上突破现有的限制。具体来说,TDSFSIRLA 首先引入了一种高效的三度随机全对称插值规则,通过生成自适应维度特征图来实现对高斯核函数的精确逼近。这一创新不仅大大减少了所需正交节点的数量,有效降低了计算成本,而且保持了极高的逼近精度,为处理大规模数据集提供了坚实的理论基础。此外,为了克服局部逼近方法的不一致性,本文采用广义稳健贝叶斯委员会机(GRBCM)作为局部专家的聚合框架。GRBCM 通过其固有的一致性和鲁棒性保证了各局部模型预测结果的和谐统一,显著提高了整体预测的稳定性和可靠性。更重要的是,针对专家权重分配不均的问题,本文创造性地引入了 Tsallis 互信息作为权重分配的指标。Tsallis 互信息能够灵敏地捕捉信息复杂性,为不同的本地专家分配与其贡献相匹配的权重,有效解决了权重分配不均导致的预测偏差问题,进一步提高了预测精度。在实验验证阶段,本文在多个合成数据集和七个具有代表性的真实数据集上进行了全面测试。结果表明,TDSFSIRLA 方法不仅显著降低了时间复杂度,而且在预测精度方面表现优异,充分验证了其在大规模高斯过程回归领域的显著优势和广阔应用前景。
{"title":"Jointly stochastic fully symmetric interpolatory rules and local approximation for scalable Gaussian process regression","authors":"","doi":"10.1016/j.patcog.2024.111125","DOIUrl":"10.1016/j.patcog.2024.111125","url":null,"abstract":"&lt;div&gt;&lt;div&gt;When exploring the broad application prospects of large-scale Gaussian process regression (GPR), three core challenges significantly constrain its full effectiveness: firstly, the &lt;span&gt;&lt;math&gt;&lt;mrow&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/math&gt;&lt;/span&gt; time complexity of computing the inverse covariance matrix of &lt;span&gt;&lt;math&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;/math&gt;&lt;/span&gt; training points becomes an insurmountable performance bottleneck when processing large-scale datasets; Secondly, although traditional local approximation methods are widely used, they are often limited by the inconsistency of prediction results; The third issue is that many aggregation strategies lack discrimination when evaluating the importance of experts (i.e. local models), resulting in a loss of overall prediction accuracy. In response to the above challenges, this article innovatively proposes a comprehensive method that integrates third-degree stochastic fully symmetric interpolatory rules (TDSFSI), local approximation, and Tsallis mutual information (TDSFSIRLA), aiming to fundamentally break through existing limitations. Specifically, TDSFSIRLA first introduces an efficient third-degree stochastic fully symmetric interpolatory rules, which achieves accurate approximation of Gaussian kernel functions by generating adaptive dimensional feature maps. This innovation not only significantly reduces the number of required orthogonal nodes and effectively lowers computational costs, but also maintains extremely high approximation accuracy, providing a solid theoretical foundation for processing large-scale datasets. Furthermore, in order to overcome the inconsistency of local approximation methods, this paper adopts the Generalized Robust Bayesian Committee Machine (GRBCM) as the aggregation framework for local experts. GRBCM ensures the harmonious unity of the prediction results of each local model through its inherent consistency and robustness, significantly improving the stability and reliability of the overall prediction. More importantly, in response to the issue of uneven distribution of expert weights, this article creatively introduces Tsallis mutual information as a metric for weight allocation. Tsallis mutual information, with its sensitive ability to capture information complexity, assigns weights to different local experts that match their contribution, effectively solving the problem of prediction bias caused by uneven weight distribution and further improving prediction accuracy. In the experimental verification phase, this article conducted comprehensive testing on multiple synthetic datasets and seven representative real datasets. The results show that the TDSFSIRLA method not only achieves significant reduction in time complexity, but also demonstrates excellent performance in prediction accuracy, fully verifying its significant advantages and broad application prospects in the field of large-scale Gaussi","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Apply prior feature integration to sparse object detectors 将先验特征整合应用于稀疏物体检测器
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-31 DOI: 10.1016/j.patcog.2024.111103
Noisy boxes as queries for sparse object detection has become a hot topic of research in recent years. Sparse R-CNN achieves one-to-one prediction from noisy boxes to object boxes, while DiffusionDet transforms the prediction process of Sparse R-CNN into multiple diffusion processes. Especially, algorithms such as Sparse R-CNN and its improved versions all rely on FPN to extract features for ROI Aligning. But the target only matching one feature map in FPN, which is inefficient and resource-consuming. otherwise, these methods like sparse object detection crop regions from noisy boxes for prediction, resulting in boxes failing to capture global features. In this work, we rethink the detection paradigm of sparse object detection and propose two improvements and produce a new object detector, called Prior Sparse R-CNN. Firstly, we replace the original FPN neck with a neck that only outputs one feature map to improve efficiency. Then, we design aggregated encoder after neck to solve the object scale problem through dilated residual blocks and feature aggregation. Another improvement is that we introduce prior knowledge for noisy boxes to enhance their understanding of global representations. Region Generation network (RGN) is designed by us to generate global object information and fuse it with the features of noisy boxes as prior knowledge. Prior Sparse R-CNN reaches the state-of-the-art 47.0 AP on COCO 2017 validation set, surpassing DiffusionDet by 1.5 AP with ResNet-50 backbone. Additionally, our training epoch requires only 3/5 of the time.
将噪声箱作为稀疏物体检测的查询对象已成为近年来的研究热点。稀疏 R-CNN 实现了从噪声盒到物体盒的一对一预测,而 DiffusionDet 则将稀疏 R-CNN 的预测过程转化为多个扩散过程。尤其是 Sparse R-CNN 及其改进版等算法,都是依靠 FPN 提取特征进行 ROI 对齐。但在 FPN 中,目标只匹配一个特征图,效率低且耗费资源。否则,这些方法(如稀疏对象检测)会从噪声盒中裁剪区域进行预测,导致盒无法捕捉全局特征。在这项工作中,我们对稀疏物体检测的检测范式进行了重新思考,并提出了两个改进方案,生成了一种新的物体检测器,称为 Prior Sparse R-CNN。首先,我们用一个只输出一个特征图的颈部来代替原来的 FPN 颈部,以提高效率。然后,我们在颈部之后设计了聚合编码器,通过扩张残差块和特征聚合来解决物体尺度问题。另一项改进是,我们为噪声盒引入了先验知识,以增强其对全局表征的理解。我们设计了区域生成网络(RGN)来生成全局对象信息,并将其与噪声盒的特征作为先验知识进行融合。先验稀疏 R-CNN 在 COCO 2017 验证集上达到了最先进的 47.0 AP,比使用 ResNet-50 骨干的 DiffusionDet 高出 1.5 AP。此外,我们的训练历时只需要 3/5 的时间。
{"title":"Apply prior feature integration to sparse object detectors","authors":"","doi":"10.1016/j.patcog.2024.111103","DOIUrl":"10.1016/j.patcog.2024.111103","url":null,"abstract":"<div><div>Noisy boxes as queries for sparse object detection has become a hot topic of research in recent years. Sparse R-CNN achieves one-to-one prediction from noisy boxes to object boxes, while DiffusionDet transforms the prediction process of Sparse R-CNN into multiple diffusion processes. Especially, algorithms such as Sparse R-CNN and its improved versions all rely on FPN to extract features for ROI Aligning. But the target only matching one feature map in FPN, which is inefficient and resource-consuming. otherwise, these methods like sparse object detection crop regions from noisy boxes for prediction, resulting in boxes failing to capture global features. In this work, we rethink the detection paradigm of sparse object detection and propose two improvements and produce a new object detector, called Prior Sparse R-CNN. Firstly, we replace the original FPN neck with a neck that only outputs one feature map to improve efficiency. Then, we design aggregated encoder after neck to solve the object scale problem through dilated residual blocks and feature aggregation. Another improvement is that we introduce prior knowledge for noisy boxes to enhance their understanding of global representations. Region Generation network (RGN) is designed by us to generate global object information and fuse it with the features of noisy boxes as prior knowledge. Prior Sparse R-CNN reaches the state-of-the-art 47.0 AP on COCO 2017 validation set, surpassing DiffusionDet by 1.5 AP with ResNet-50 backbone. Additionally, our training epoch requires only 3/5 of the time.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition 用于基于骨骼的动作识别的局部和全局自注意力增强型图卷积网络
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-31 DOI: 10.1016/j.patcog.2024.111106
The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at https://github.com/DingYyue/LG-SGNet.
目前,基于骨骼的动作识别的成功范例是图形卷积网络(GCN)与时态卷积网络(TCN)的结合,前者模拟空间相关性,后者提取运动特征。这种基于 GCN-TCN 的方法通常依赖于局部图卷积运算,这就限制了它们捕捉远处关节间复杂关联以及表示长距离依赖关系的能力。虽然源自变形器的自注意力在全局关节的相关性建模方面显示出巨大潜力,但基于变形器的方法通常计算成本高昂,而且忽略了人体骨骼的物理连接结构。为解决这些问题,我们提出了一种新颖的局部-全局自注意力增强图卷积网络(LG-SGNet),可同时学习时空维度的局部和全局表征。我们的方法由三个部分组成:局部-全局图卷积网络(LG-GCN)模块通过并行的特定信道全局和局部空间建模,提取局部和全局空间特征表征。局部-全局时空卷积网络(LG-TCN)模块在进行局部时空建模的同时,利用多头自注意力联合进行全局时空建模。这构成了一种新的多分支时空卷积结构,能有效捕捉长距离依赖关系和微妙的时空结构。最后,动态帧加权模块(DFWM)可调整骨架动作序列帧的权重,使模型能够自适应地关注代表性帧的特征,从而提高动作识别的效率。广泛的实验证明,与最先进的方法相比,我们的 LG-SGNet 的性能极具竞争力。我们的项目网站是 https://github.com/DingYyue/LG-SGNet。
{"title":"Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition","authors":"","doi":"10.1016/j.patcog.2024.111106","DOIUrl":"10.1016/j.patcog.2024.111106","url":null,"abstract":"<div><div>The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at <span><span>https://github.com/DingYyue/LG-SGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainability-based knowledge distillation 基于可解释性的知识提炼
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-30 DOI: 10.1016/j.patcog.2024.111095
Knowledge distillation (KD) is a popular approach for deep model acceleration. Based on the knowledge distilled, we categorize KD methods as label-related and structure-related. The former distills the very abstract (high-level) knowledge, e.g., logits; and the latter uses the spatial (low- or medium-level feature) knowledge. However, existing KD methods are usually not explainable, i.e., we do not know what knowledge is transferred during distillation. In this work, we propose a new KD method, Explainability-based Knowledge Distillation (Exp-KD). Specifically, we propose to use class activation map (CAM) as the explainable knowledge which can effectively capture both label- and structure-related information during the distillation. We conduct extensive experiments, including image classification tasks on CIFAR-10, CIFAR-100 and ImageNet datasets, and explainability tests on ImageNet and ImageNet-Segmentation. The results show the great effectiveness and explainability of Exp-KD compared with the state-of-the-art. Code is available at https://github.com/Blenderama/Exp-KD.
知识提炼(KD)是深度模型加速的一种流行方法。根据所提炼的知识,我们将知识提炼方法分为标签相关和结构相关两类。前者提炼的是非常抽象(高层次)的知识,如对数;后者使用的是空间(低层或中层特征)知识。然而,现有的 KD 方法通常无法解释,也就是说,我们不知道在提炼过程中传输了哪些知识。在这项工作中,我们提出了一种新的 KD 方法,即基于可解释性的知识蒸馏(Exp-KD)。具体来说,我们建议使用类激活图(CAM)作为可解释知识,它能在蒸馏过程中有效捕捉标签和结构相关信息。我们进行了广泛的实验,包括 CIFAR-10、CIFAR-100 和 ImageNet 数据集上的图像分类任务,以及 ImageNet 和 ImageNet-Segmentation 上的可解释性测试。实验结果表明,与最先进的技术相比,Exp-KD 具有极高的有效性和可解释性。代码见 https://github.com/Blenderama/Exp-KD。
{"title":"Explainability-based knowledge distillation","authors":"","doi":"10.1016/j.patcog.2024.111095","DOIUrl":"10.1016/j.patcog.2024.111095","url":null,"abstract":"<div><div>Knowledge distillation (KD) is a popular approach for deep model acceleration. Based on the knowledge distilled, we categorize KD methods as label-related and structure-related. The former distills the very abstract (high-level) knowledge, e.g., logits; and the latter uses the spatial (low- or medium-level feature) knowledge. However, existing KD methods are usually not explainable, i.e., we do not know what knowledge is transferred during distillation. In this work, we propose a new KD method, Explainability-based Knowledge Distillation (Exp-KD). Specifically, we propose to use class activation map (CAM) as the explainable knowledge which can effectively capture both label- and structure-related information during the distillation. We conduct extensive experiments, including image classification tasks on CIFAR-10, CIFAR-100 and ImageNet datasets, and explainability tests on ImageNet and ImageNet-Segmentation. The results show the great effectiveness and explainability of Exp-KD compared with the state-of-the-art. Code is available at <span><span>https://github.com/Blenderama/Exp-KD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-task OCTA image segmentation with innovative dimension compression 利用创新维度压缩技术进行多任务 OCTA 图像分割
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-30 DOI: 10.1016/j.patcog.2024.111123
Optical Coherence Tomography Angiography (OCTA) plays a crucial role in the early detection and continuous monitoring of ocular diseases, which relies on accurate multi-tissue segmentation of retinal images. Existing OCTA segmentation methods typically focus on single-task designs that do not fully utilize the information of volume data in these images. To bridge this gap, our study introduces H2C-Net, a novel network architecture engineered for simultaneous and precise segmentation of various retinal structures, including capillaries, arteries, veins, and the fovea avascular zone (FAZ). At its core, H2C-Net consists of a plug-and-play Height-Channel Module (H2C) and an Enhanced U-shaped Network (GPC-Net). The H2C module cleverly converts the height information of the OCTA volume data into channel information through the Squeeze operation, realizes the lossless dimensionality reduction from 3D to 2D, and provides the "Soft layering" information by unidirectional pooling. Meanwhile, in order to guide the network to focus on channels for training, U-Net is enhanced with group normalization, channel attention mechanism, and Parametric Rectified Linear Unit (PReLU), which reduces the dependence on batch size and enhances the network's ability to extract salient features. Extensive experiments on two subsets of the publicly available OCTA-500 dataset have shown that H2C-Net outperforms existing state-of-the-art methods. It achieves average Intersection over Union (IoU) scores of 82.84 % and 88.48 %, marking improvements of 0.81 % and 1.59 %, respectively. Similarly, the average Dice scores are elevated to 90.40 % and 93.76 %, exceeding previous benchmarks by 0.42 % and 0.94 %. The proposed H2C-Net exhibits excellent performance in OCTA image segmentation, providing an efficient and accurate multi-task segmentation solution in ophthalmic diagnostics. The code is publicly available at: https://github.com/IAAI-SIT/H2C-Net.
光学相干断层扫描血管造影术(OCTA)在早期检测和持续监测眼部疾病方面发挥着至关重要的作用,这有赖于对视网膜图像进行准确的多组织分割。现有的 OCTA 分割方法通常侧重于单任务设计,不能充分利用这些图像中的体积数据信息。为了弥补这一缺陷,我们的研究引入了 H2C-Net,这是一种新颖的网络架构,可同时精确分割各种视网膜结构,包括毛细血管、动脉、静脉和眼窝无血管区(FAZ)。H2C-Net 的核心包括一个即插即用的高度通道模块(H2C)和一个增强型 U 形网络(GPC-Net)。H2C 模块通过挤压(Squeeze)操作将 OCTA 容积数据的高度信息巧妙地转换为通道信息,实现了从三维到二维的无损降维,并通过单向汇集提供 "软分层 "信息。同时,为了引导网络聚焦于通道进行训练,U-Net 还增强了组归一化、通道关注机制和参数整流线性单元(PReLU),从而降低了对批量大小的依赖,增强了网络提取突出特征的能力。在公开的 OCTA-500 数据集的两个子集上进行的广泛实验表明,H2C-Net 优于现有的最先进方法。它的平均 "联合交叉"(IoU)得分分别为 82.84 % 和 88.48 %,分别提高了 0.81 % 和 1.59 %。同样,平均 Dice 分数也分别提高到 90.40 % 和 93.76 %,比以前的基准高出 0.42 % 和 0.94 %。所提出的 H2C-Net 在 OCTA 图像分割中表现出色,为眼科诊断提供了高效、准确的多任务分割解决方案。代码可在以下网址公开获取:https://github.com/IAAI-SIT/H2C-Net。
{"title":"Multi-task OCTA image segmentation with innovative dimension compression","authors":"","doi":"10.1016/j.patcog.2024.111123","DOIUrl":"10.1016/j.patcog.2024.111123","url":null,"abstract":"<div><div>Optical Coherence Tomography Angiography (OCTA) plays a crucial role in the early detection and continuous monitoring of ocular diseases, which relies on accurate multi-tissue segmentation of retinal images. Existing OCTA segmentation methods typically focus on single-task designs that do not fully utilize the information of volume data in these images. To bridge this gap, our study introduces H2C-Net, a novel network architecture engineered for simultaneous and precise segmentation of various retinal structures, including capillaries, arteries, veins, and the fovea avascular zone (FAZ). At its core, H2C-Net consists of a plug-and-play Height-Channel Module (H2C) and an Enhanced U-shaped Network (GPC-Net). The H2C module cleverly converts the height information of the OCTA volume data into channel information through the Squeeze operation, realizes the lossless dimensionality reduction from 3D to 2D, and provides the \"Soft layering\" information by unidirectional pooling. Meanwhile, in order to guide the network to focus on channels for training, U-Net is enhanced with group normalization, channel attention mechanism, and Parametric Rectified Linear Unit (PReLU), which reduces the dependence on batch size and enhances the network's ability to extract salient features. Extensive experiments on two subsets of the publicly available OCTA-500 dataset have shown that H2C-Net outperforms existing state-of-the-art methods. It achieves average Intersection over Union (IoU) scores of 82.84 % and 88.48 %, marking improvements of 0.81 % and 1.59 %, respectively. Similarly, the average Dice scores are elevated to 90.40 % and 93.76 %, exceeding previous benchmarks by 0.42 % and 0.94 %. The proposed H2C-Net exhibits excellent performance in OCTA image segmentation, providing an efficient and accurate multi-task segmentation solution in ophthalmic diagnostics. The code is publicly available at: <span><span>https://github.com/IAAI-SIT/H2C-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-modal independent matching network for image-text retrieval 用于图像文本检索的跨模态独立匹配网络
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-29 DOI: 10.1016/j.patcog.2024.111096
Image-text retrieval serves as a bridge connecting vision and language. Mainstream modal cross matching methods can effectively perform cross-modal interactions with high theoretical performance. However, there is a deficiency in efficiency. Modal independent matching methods exhibit superior efficiency but lack in performance. Therefore, achieving a balance between matching efficiency and performance becomes a challenge in the field of image-text retrieval. In this paper, we propose a new Cross-modal Independent Matching Network (CIMN) for image-text retrieval. Specifically, we first use the proposed Feature Relationship Reasoning (FRR) to infer neighborhood and potential relations of modal features. Then, we introduce Graph Pooling (GP) based on graph convolutional networks to perform modal global semantic aggregation. Finally, we introduce the Gravitation Loss (GL) by incorporating sample mass into the learning process. This loss can correct the matching relationship between and within each modality, avoiding the problem of equal treatment of all samples in the traditional triplet loss. Extensive experiments on Flickr30K and MSCOCO datasets demonstrate the superiority of the proposed method. It achieves a good balance between matching efficiency and performance, surpasses other similar independent matching methods in performance, and can obtain retrieval accuracy comparable to some mainstream cross matching methods with an order of magnitude lower inference time.
图像-文本检索是连接视觉和语言的桥梁。主流的模态交叉匹配方法可以有效地进行跨模态交互,并具有较高的理论性能。但在效率方面存在不足。独立模态匹配方法效率高,但性能不足。因此,如何在匹配效率和性能之间取得平衡成为图像-文本检索领域的一项挑战。本文提出了一种用于图像文本检索的新型跨模态独立匹配网络(CIMN)。具体来说,我们首先使用提出的特征关系推理(FRR)来推断模态特征的邻域和潜在关系。然后,我们引入基于图卷积网络的图池化(GP)来执行模态全局语义聚合。最后,我们将样本质量纳入学习过程,引入引力损失(GL)。这种损失可以纠正每种模态之间和内部的匹配关系,避免了传统三重损失中平等对待所有样本的问题。在 Flickr30K 和 MSCOCO 数据集上进行的大量实验证明了所提出方法的优越性。它在匹配效率和性能之间实现了良好的平衡,在性能上超越了其他类似的独立匹配方法,并能获得与一些主流交叉匹配方法相当的检索精度,推理时间却低了一个数量级。
{"title":"Cross-modal independent matching network for image-text retrieval","authors":"","doi":"10.1016/j.patcog.2024.111096","DOIUrl":"10.1016/j.patcog.2024.111096","url":null,"abstract":"<div><div>Image-text retrieval serves as a bridge connecting vision and language. Mainstream modal cross matching methods can effectively perform cross-modal interactions with high theoretical performance. However, there is a deficiency in efficiency. Modal independent matching methods exhibit superior efficiency but lack in performance. Therefore, achieving a balance between matching efficiency and performance becomes a challenge in the field of image-text retrieval. In this paper, we propose a new Cross-modal Independent Matching Network (CIMN) for image-text retrieval. Specifically, we first use the proposed Feature Relationship Reasoning (FRR) to infer neighborhood and potential relations of modal features. Then, we introduce Graph Pooling (GP) based on graph convolutional networks to perform modal global semantic aggregation. Finally, we introduce the Gravitation Loss (GL) by incorporating sample mass into the learning process. This loss can correct the matching relationship between and within each modality, avoiding the problem of equal treatment of all samples in the traditional triplet loss. Extensive experiments on Flickr30K and MSCOCO datasets demonstrate the superiority of the proposed method. It achieves a good balance between matching efficiency and performance, surpasses other similar independent matching methods in performance, and can obtain retrieval accuracy comparable to some mainstream cross matching methods with an order of magnitude lower inference time.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fully exploring object relation interaction and hidden state attention for video captioning 充分探索视频字幕的对象关系互动和隐藏状态关注
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-28 DOI: 10.1016/j.patcog.2024.111138
Video Captioning (VC) is a challenging task of automatically generating natural language sentences for describing video contents. As a video often contains multiple objects, it is comprehensively crucial to identify multiple objects and model relationships between them. Previous models usually adopt Graph Convolutional Networks (GCN) to infer relational information via object nodes, but there exist uncertainty and over-smoothing issues of relational reasoning. To tackle these issues, we propose a Knowledge Graph based Video Captioning Network (KG-VCN) by fully exploring object relation interaction, hidden state and attention enhancement. In encoding stages, we present a Graph and Convolution Hybrid Encoder (GCHE), which uses an object detector to find visual objects with bounding boxes for Knowledge Graph (KG) and Convolutional Neural Network (CNN). To model intrinsic relations between detected objects, we propose a knowledge graph based Object Relation Graph Interaction (ORGI) module. In ORGI, we design triplets (head, relation, tail) to efficiently mine object relations, and create a global node to enable adequate information flow among all graph nodes for avoiding possibly missed relations. To produce accurate and rich captions, we propose a hidden State and Attention Enhanced Decoder (SAED) by integrating hidden states and dynamically updated attention features. Our SAED accepts both relational and visual features, adopts Long Short-Term Memory (LSTM) to produce hidden states, and dynamically update attention features. Unlike existing methods, we concatenate state and attention features to predict next word sequentially. To demonstrate the effectiveness of our model, we conduct experiments on three well-known datasets (MSVD, MSR-VTT, VaTeX), and our model achieves impressive results significantly outperforming existing state-of-the-art models.
视频字幕制作(VC)是一项具有挑战性的任务,需要自动生成描述视频内容的自然语言句子。由于视频通常包含多个对象,因此识别多个对象并建立它们之间的关系模型至关重要。以往的模型通常采用图卷积网络(GCN)通过对象节点推断关系信息,但存在关系推理的不确定性和过度平滑问题。针对这些问题,我们提出了基于知识图谱的视频字幕网络(KG-VCN),充分挖掘了对象关系的交互性、隐藏状态和注意力增强。在编码阶段,我们提出了图与卷积混合编码器(GCHE),它使用对象检测器为知识图谱(KG)和卷积神经网络(CNN)找到带有边界框的视觉对象。为了对检测到的物体之间的内在关系建模,我们提出了基于知识图谱的物体关系图交互(ORGI)模块。在 ORGI 中,我们设计了三元组(头部、关系、尾部)来有效挖掘对象关系,并创建了一个全局节点,使所有图节点之间的信息流充分流动,避免可能遗漏的关系。为了生成准确而丰富的字幕,我们提出了一种隐藏状态和注意力增强解码器(SAED),它整合了隐藏状态和动态更新的注意力特征。我们的 SAED 同时接受关系和视觉特征,采用长短时记忆(LSTM)生成隐藏状态,并动态更新注意力特征。与现有方法不同的是,我们将状态和注意力特征串联起来,按顺序预测下一个单词。为了证明我们的模型的有效性,我们在三个著名的数据集(MSVD、MSR-VTT、VaTeX)上进行了实验,我们的模型取得了令人印象深刻的结果,大大超过了现有的最先进模型。
{"title":"Fully exploring object relation interaction and hidden state attention for video captioning","authors":"","doi":"10.1016/j.patcog.2024.111138","DOIUrl":"10.1016/j.patcog.2024.111138","url":null,"abstract":"<div><div>Video Captioning (VC) is a challenging task of automatically generating natural language sentences for describing video contents. As a video often contains multiple objects, it is comprehensively crucial to identify multiple objects and model relationships between them. Previous models usually adopt Graph Convolutional Networks (GCN) to infer relational information via object nodes, but there exist uncertainty and over-smoothing issues of relational reasoning. To tackle these issues, we propose a Knowledge Graph based Video Captioning Network (KG-VCN) by fully exploring object relation interaction, hidden state and attention enhancement. In encoding stages, we present a Graph and Convolution Hybrid Encoder (GCHE), which uses an object detector to find visual objects with bounding boxes for Knowledge Graph (KG) and Convolutional Neural Network (CNN). To model intrinsic relations between detected objects, we propose a knowledge graph based Object Relation Graph Interaction (ORGI) module. In ORGI, we design triplets (<em>head, relation, tail</em>) to efficiently mine object relations, and create a global node to enable adequate information flow among all graph nodes for avoiding possibly missed relations. To produce accurate and rich captions, we propose a hidden State and Attention Enhanced Decoder (SAED) by integrating hidden states and dynamically updated attention features. Our SAED accepts both relational and visual features, adopts Long Short-Term Memory (LSTM) to produce hidden states, and dynamically update attention features. Unlike existing methods, we concatenate state and attention features to predict next word sequentially. To demonstrate the effectiveness of our model, we conduct experiments on three well-known datasets (MSVD, MSR-VTT, VaTeX), and our model achieves impressive results significantly outperforming existing state-of-the-art models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A newton interpolation network for smoke semantic segmentation 用于烟雾语义分割的牛顿插值网络
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-28 DOI: 10.1016/j.patcog.2024.111119
Smoke has large variances of visual appearances that are very adverse to visual segmentation. Furthermore, its semi-transparency often produces highly complicated mixtures of smoke and backgrounds. These factors lead to great difficulties in labelling and segmenting smoke regions. To improve accuracy of smoke segmentation, we propose a Newton Interpolation Network (NINet) for visual smoke semantic segmentation. Unlike simply concatenating or point-wisely adding multi-scale encoded feature maps for information fusion or re-usage, we design a Newton Interpolation Module (NIM) to extract structured information by analyzing the feature values in the same position but from encoded feature maps with different scales. Interpolated features by our NIM contain long-range dependency and semantic structures across different levels, but traditional fusion of multi-scale feature maps cannot model intrinsic structures embedded in these maps. To obtain multi-scale structured information, we repeatedly use the proposed NIM at different levels of the decoding stages. In addition, we use more encoded feature maps to construct a higher order Newton interpolation polynomial for extracting higher order information. Extensive experiments validate that our method significantly outperforms existing state-of-the-art algorithms on virtual and real smoke datasets, and ablation experiments also validate the effectiveness of our NIMs.
烟雾的视觉外观差异很大,非常不利于视觉分割。此外,烟雾的半透明性往往会产生非常复杂的烟雾和背景混合物。这些因素给烟雾区域的标记和分割带来了很大困难。为了提高烟雾分割的准确性,我们提出了一种用于视觉烟雾语义分割的牛顿插值网络(NINet)。不同于简单地将多尺度编码特征图进行连接或点向添加以实现信息融合或重复使用,我们设计了牛顿插值模块(NIM),通过分析同一位置但来自不同尺度编码特征图的特征值来提取结构化信息。我们的牛顿插值模块所插值的特征包含跨不同层次的长距离依赖关系和语义结构,但传统的多尺度特征图融合无法模拟这些特征图中蕴含的内在结构。为了获得多尺度结构信息,我们在解码阶段的不同层次反复使用了所提出的 NIM。此外,我们使用更多的编码特征图来构建高阶牛顿插值多项式,以提取更高阶的信息。大量实验证明,在虚拟和真实烟雾数据集上,我们的方法明显优于现有的最先进算法,而消融实验也验证了我们的 NIMs 的有效性。
{"title":"A newton interpolation network for smoke semantic segmentation","authors":"","doi":"10.1016/j.patcog.2024.111119","DOIUrl":"10.1016/j.patcog.2024.111119","url":null,"abstract":"<div><div>Smoke has large variances of visual appearances that are very adverse to visual segmentation. Furthermore, its semi-transparency often produces highly complicated mixtures of smoke and backgrounds. These factors lead to great difficulties in labelling and segmenting smoke regions. To improve accuracy of smoke segmentation, we propose a Newton Interpolation Network (NINet) for visual smoke semantic segmentation. Unlike simply concatenating or point-wisely adding multi-scale encoded feature maps for information fusion or re-usage, we design a Newton Interpolation Module (NIM) to extract structured information by analyzing the feature values in the same position but from encoded feature maps with different scales. Interpolated features by our NIM contain long-range dependency and semantic structures across different levels, but traditional fusion of multi-scale feature maps cannot model intrinsic structures embedded in these maps. To obtain multi-scale structured information, we repeatedly use the proposed NIM at different levels of the decoding stages. In addition, we use more encoded feature maps to construct a higher order Newton interpolation polynomial for extracting higher order information. Extensive experiments validate that our method significantly outperforms existing state-of-the-art algorithms on virtual and real smoke datasets, and ablation experiments also validate the effectiveness of our NIMs.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring sample relationship for few-shot classification 探索少镜头分类的样本关系
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-28 DOI: 10.1016/j.patcog.2024.111089
Few-shot classification (FSC) is a challenging problem, which aims to identify novel classes with limited samples. Most existing methods employ vanilla transfer learning or episodic meta-training to learn a feature extractor, and then measure the similarity between the query image and the few support examples of novel classes. However, these approaches merely learn feature representations from individual images, overlooking the exploration of the interrelationships among images. This neglect can hinder the attainment of more discriminative feature representations, thus limiting the potential improvement of few-shot classification performance. To address this issue, we propose a Sample Relationship Exploration (SRE) module comprising the Sample-level Attention (SA), Explicit Guidance (EG) and Channel-wise Adaptive Fusion (CAF) components, to learn discriminative category-related features. Specifically, we first employ the SA component to explore the similarity relationships among samples and obtain aggregated features of similar samples. Furthermore, to enhance the robustness of these features, we introduce the EG component to explicitly guide the learning of sample relationships by providing an ideal affinity map among samples. Finally, the CAF component is adopted to perform weighted fusion of the original features and the aggregated features, yielding category-related embeddings. The proposed method is a plug-and-play module which can be embedded into both transfer learning and meta-learning based few-shot classification frameworks. Extensive experiments on benchmark datasets show that the proposed module can effectively improve the performance over baseline models, and also perform competitively against the state-of-the-art algorithms. The source code is available at https://github.com/Chenguoz/SRE.
快速分类(FSC)是一个具有挑战性的问题,其目的是利用有限的样本识别新类别。大多数现有方法都采用香草迁移学习或偶发元训练来学习特征提取器,然后测量查询图像与新类别的少数支持示例之间的相似性。然而,这些方法只是从单个图像中学习特征表征,忽略了对图像间相互关系的探索。这种忽视可能会阻碍获得更具区分性的特征表征,从而限制了少量图像分类性能的潜在提高。为解决这一问题,我们提出了一个样本关系探索(SRE)模块,该模块由样本级关注(SA)、显式引导(EG)和信道自适应融合(CAF)组件组成,用于学习与类别相关的判别特征。具体来说,我们首先利用 SA 组件探索样本之间的相似性关系,并获得相似样本的聚合特征。此外,为了增强这些特征的鲁棒性,我们引入了 EG 组件,通过提供样本间的理想亲和图来明确指导样本关系的学习。最后,我们采用 CAF 组件对原始特征和聚合特征进行加权融合,得到与类别相关的嵌入。所提出的方法是一个即插即用的模块,可以嵌入到基于迁移学习和元学习的少量分类框架中。在基准数据集上进行的大量实验表明,与基线模型相比,所提出的模块能有效提高性能,与最先进的算法相比也具有竞争力。源代码见 https://github.com/Chenguoz/SRE。
{"title":"Exploring sample relationship for few-shot classification","authors":"","doi":"10.1016/j.patcog.2024.111089","DOIUrl":"10.1016/j.patcog.2024.111089","url":null,"abstract":"<div><div>Few-shot classification (FSC) is a challenging problem, which aims to identify novel classes with limited samples. Most existing methods employ vanilla transfer learning or episodic meta-training to learn a feature extractor, and then measure the similarity between the query image and the few support examples of novel classes. However, these approaches merely learn feature representations from individual images, overlooking the exploration of the interrelationships among images. This neglect can hinder the attainment of more discriminative feature representations, thus limiting the potential improvement of few-shot classification performance. To address this issue, we propose a Sample Relationship Exploration (SRE) module comprising the Sample-level Attention (SA), Explicit Guidance (EG) and Channel-wise Adaptive Fusion (CAF) components, to learn discriminative category-related features. Specifically, we first employ the SA component to explore the similarity relationships among samples and obtain aggregated features of similar samples. Furthermore, to enhance the robustness of these features, we introduce the EG component to explicitly guide the learning of sample relationships by providing an ideal affinity map among samples. Finally, the CAF component is adopted to perform weighted fusion of the original features and the aggregated features, yielding category-related embeddings. The proposed method is a plug-and-play module which can be embedded into both transfer learning and meta-learning based few-shot classification frameworks. Extensive experiments on benchmark datasets show that the proposed module can effectively improve the performance over baseline models, and also perform competitively against the state-of-the-art algorithms. The source code is available at <span><span>https://github.com/Chenguoz/SRE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SeaTrack: Rethinking Observation-Centric SORT for Robust Nearshore Multiple Object Tracking SeaTrack:重新思考以观测为中心的 SORT,实现稳健的近岸多目标跟踪
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-28 DOI: 10.1016/j.patcog.2024.111091
Nearshore Multiple Object Tracking (NMOT) aims to locate and associate nearshore objects. Current approaches utilize Automatic Identification Systems (AIS) and radar to accomplish this task. However, video signals can describe the visual appearance of nearshore objects without prior information such as identity, location, or motion. In addition, sea clutter will not affect the capture of living objects by visual sensors. Recognizing this, we analyzed three key long-term challenges of the vision-based NMOT and proposed a tracking pipeline that relies solely on motion information. Maritime objects are highly susceptible to being obscured or submerged by waves, resulting in fragmented tracklets. We first introduced guiding modulation to address the long-term occlusion and interaction of maritime objects. Subsequently, we modeled confidence, altitude, and angular momentum to mitigate the effects of motion blur, ringing, and overshoot artifacts to observations in unstable imaging environments. Additionally, we designed a motion fusion mechanism that combines long-term macro tracklets with short-term fine-grained tracklets. This correction mechanism helps reduce the estimation variance of the Kalman Filter (KF) to alleviate the substantial nonlinear motion of maritime objects. We call this pipeline SeaTrack, which remains simple, online, and real-time, demonstrating excellent performance and scalability in benchmark evaluations.
近岸多目标跟踪(NMOT)旨在定位和关联近岸目标。目前的方法是利用自动识别系统(AIS)和雷达来完成这项任务。然而,视频信号可以描述近岸物体的视觉外观,而无需事先提供身份、位置或运动等信息。此外,海面杂波不会影响视觉传感器对有生命物体的捕捉。认识到这一点后,我们分析了基于视觉的 NMOT 所面临的三大长期挑战,并提出了一种完全依赖运动信息的跟踪管道。海上物体极易被波浪遮挡或淹没,从而导致轨迹片段支离破碎。我们首先引入了引导调制,以解决海上物体的长期遮挡和相互作用问题。随后,我们对置信度、高度和角动量进行建模,以减轻运动模糊、振铃和过冲伪影对不稳定成像环境下观测的影响。此外,我们还设计了一种运动融合机制,将长期宏观轨迹点与短期细粒度轨迹点相结合。这种校正机制有助于降低卡尔曼滤波器(KF)的估计方差,从而减轻海洋物体的大量非线性运动。我们将这一管道称为 SeaTrack,它保持了简单、在线和实时的特点,在基准评估中表现出卓越的性能和可扩展性。
{"title":"SeaTrack: Rethinking Observation-Centric SORT for Robust Nearshore Multiple Object Tracking","authors":"","doi":"10.1016/j.patcog.2024.111091","DOIUrl":"10.1016/j.patcog.2024.111091","url":null,"abstract":"<div><div>Nearshore Multiple Object Tracking (NMOT) aims to locate and associate nearshore objects. Current approaches utilize Automatic Identification Systems (AIS) and radar to accomplish this task. However, video signals can describe the visual appearance of nearshore objects without prior information such as identity, location, or motion. In addition, sea clutter will not affect the capture of living objects by visual sensors. Recognizing this, we analyzed three key long-term challenges of the vision-based NMOT and proposed a tracking pipeline that relies solely on motion information. Maritime objects are highly susceptible to being obscured or submerged by waves, resulting in fragmented tracklets. We first introduced guiding modulation to address the long-term occlusion and interaction of maritime objects. Subsequently, we modeled confidence, altitude, and angular momentum to mitigate the effects of motion blur, ringing, and overshoot artifacts to observations in unstable imaging environments. Additionally, we designed a motion fusion mechanism that combines long-term macro tracklets with short-term fine-grained tracklets. This correction mechanism helps reduce the estimation variance of the Kalman Filter (KF) to alleviate the substantial nonlinear motion of maritime objects. We call this pipeline SeaTrack, which remains simple, online, and real-time, demonstrating excellent performance and scalability in benchmark evaluations.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1