ISPRS Journal of Photogrammetry and Remote Sensing最新文献_第6页

Recursive classification of satellite imaging time-series: An application to land cover mapping 卫星成像时间序列的递归分类：应用于土地覆被制图

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-27 DOI: 10.1016/j.isprsjprs.2024.09.003

Helena Calatrava , Bhavya Duvvuri , Haoqing Li , Ricardo Borsoi , Edward Beighley , Deniz Erdoğmuş , Pau Closas , Tales Imbiriba

Despite the extensive body of literature focused on remote sensing applications for land cover mapping and the availability of high-resolution satellite imagery, methods for continuously updating classification maps in real-time remain limited, especially when training data is scarce. This paper introduces the recursive Bayesian classifier (RBC), which converts any instantaneous classifier into a robust online method through a probabilistic framework that is resilient to non-informative image variations. Three experiments are conducted using Sentinel-2 data: water mapping of the Oroville Dam in California and the Charles River basin in Massachusetts, and deforestation detection in the Amazon. RBC is applied to a Gaussian mixture model (GMM), logistic regression (LR), and our proposed spectral index classifier (SIC). Results show that RBC significantly enhances classifier robustness in multitemporal settings under challenging conditions, such as cloud cover and cyanobacterial blooms. Specifically, balanced classification accuracy improves by up to 26.95% for SIC, 12.4% for GMM, and 13.81% for LR in water mapping, and by 15.25%, 14.17%, and 14.7% in deforestation detection. Moreover, without additional training data, RBC improves the performance of the state-of-the-art DeepWaterMap and WatNet algorithms by up to 9.62% and 11.03%. These benefits are provided by RBC while requiring minimal supervision and maintaining a low computational cost that remains constant for each time step regardless of the time-series length.

尽管有大量文献关注遥感技术在土地覆被制图中的应用，也有大量高分辨率卫星图像可供使用，但实时持续更新分类图的方法仍然有限，尤其是在训练数据稀缺的情况下。本文介绍了递归贝叶斯分类器（RBC），该分类器通过一个概率框架将任何瞬时分类器转换为稳健的在线方法，可抵御非信息图像变化。利用哨兵-2 数据进行了三项实验：加利福尼亚州奥罗维尔大坝和马萨诸塞州查尔斯河流域的水地图绘制，以及亚马逊森林砍伐检测。RBC 被应用于高斯混合模型 (GMM)、逻辑回归 (LR) 和我们提出的光谱指数分类器 (SIC)。结果表明，在云层覆盖和蓝藻水华等具有挑战性的条件下，RBC 能显著增强分类器在多时空环境中的鲁棒性。具体来说，在水地图绘制中，SIC 的平衡分类准确率提高了 26.95%，GMM 提高了 12.4%，LR 提高了 13.81%；在森林砍伐检测中，RBC 的平衡分类准确率分别提高了 15.25%、14.17% 和 14.7%。此外，在没有额外训练数据的情况下，RBC 将最先进的 DeepWaterMap 和 WatNet 算法的性能分别提高了 9.62% 和 11.03%。RBC 在提供这些优势的同时，只需最低限度的监督，并保持较低的计算成本，而且无论时间序列长度如何，每个时间步长都保持不变。

{"title":"Recursive classification of satellite imaging time-series: An application to land cover mapping","authors":"Helena Calatrava , Bhavya Duvvuri , Haoqing Li , Ricardo Borsoi , Edward Beighley , Deniz Erdoğmuş , Pau Closas , Tales Imbiriba","doi":"10.1016/j.isprsjprs.2024.09.003","DOIUrl":"10.1016/j.isprsjprs.2024.09.003","url":null,"abstract":"<div><div>Despite the extensive body of literature focused on remote sensing applications for land cover mapping and the availability of high-resolution satellite imagery, methods for continuously updating classification maps in real-time remain limited, especially when training data is scarce. This paper introduces the recursive Bayesian classifier (RBC), which converts any instantaneous classifier into a robust online method through a probabilistic framework that is resilient to non-informative image variations. Three experiments are conducted using Sentinel-2 data: water mapping of the Oroville Dam in California and the Charles River basin in Massachusetts, and deforestation detection in the Amazon. RBC is applied to a Gaussian mixture model (GMM), logistic regression (LR), and our proposed spectral index classifier (SIC). Results show that RBC significantly enhances classifier robustness in multitemporal settings under challenging conditions, such as cloud cover and cyanobacterial blooms. Specifically, balanced classification accuracy improves by up to 26.95% for SIC, 12.4% for GMM, and 13.81% for LR in water mapping, and by 15.25%, 14.17%, and 14.7% in deforestation detection. Moreover, without additional training data, RBC improves the performance of the state-of-the-art DeepWaterMap and WatNet algorithms by up to 9.62% and 11.03%. These benefits are provided by RBC while requiring minimal supervision and maintaining a low computational cost that remains constant for each time step regardless of the time-series length.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 447-465"},"PeriodicalIF":10.6,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mapping the Brazilian savanna’s natural vegetation: A SAR-optical uncertainty-aware deep learning approach 绘制巴西热带草原自然植被图：合成孔径雷达-光学不确定性感知深度学习方法

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-26 DOI: 10.1016/j.isprsjprs.2024.09.019

Paulo Silva Filho , Claudio Persello , Raian V. Maretto , Renato Machado

The Brazilian savanna (Cerrado) is considered a hotspot for conservation. Despite its environmental and social importance, the biome has suffered a rapid transformation process due to human activities. Mapping and monitoring the remaining vegetation is essential to guide public policies for biodiversity conservation. However, accurately mapping the Cerrado’s vegetation is still an open challenge. Its diverse but spectrally similar physiognomies are a source of confusion for state-of-the-art (SOTA) methods. This study proposes a deep learning model to map the natural vegetation of the Cerrado at the regional to biome level, fusing Synthetic Aperture Radar (SAR) and optical data. The proposed model is designed to deal with uncertainties caused by the different resolutions of the input Sentinel-1/2 images (10 m) and the reference data, derived from Landsat images (30 m). We designed a multi-resolution label-propagation (MRLP) module that infers maps at both resolutions and uses the class scores from the 30 m output as features for the 10 m classification layer. We train the model with the proposed calibrated dual focal loss function in a 2-stage hierarchical manner. Our results reached an overall accuracy of 70.37%, representing an increase of 15.64% compared to a SOTA random forest (RF) model. Moreover, we propose an uncertainty quantification method, which has shown to be useful not only in validating the model, but also in highlighting areas of label noise in the reference. The developed codes and dataset are available on Github.

巴西热带稀树草原（Cerrado）被认为是保护的热点地区。尽管在环境和社会方面具有重要意义，但由于人类活动的影响，该生物群落已经历了快速的转变过程。绘制和监测剩余植被对于指导保护生物多样性的公共政策至关重要。然而，准确绘制 Cerrado 的植被图仍然是一个公开的挑战。植被种类繁多，但光谱相似，这给最先进的（SOTA）方法带来了困惑。本研究提出了一种深度学习模型，通过融合合成孔径雷达（SAR）和光学数据，绘制从区域到生物群落级别的塞拉多自然植被图。所提议的模型旨在处理因输入的 Sentinel-1/2 图像（10 米）和从 Landsat 图像（30 米）得出的参考数据的分辨率不同而造成的不确定性。我们设计了一个多分辨率标签传播（MRLP）模块，该模块可推导出两种分辨率的地图，并将 30 米输出的类别得分作为 10 米分类层的特征。我们采用所提出的校准双焦点损失函数，以两阶段分层方式对模型进行训练。我们的结果达到了 70.37% 的总体准确率，与 SOTA 随机森林 (RF) 模型相比提高了 15.64%。此外，我们还提出了一种不确定性量化方法，结果表明该方法不仅有助于验证模型，还能突出参考文献中存在标签噪声的区域。开发的代码和数据集可在 Github 上获取。

{"title":"Mapping the Brazilian savanna’s natural vegetation: A SAR-optical uncertainty-aware deep learning approach","authors":"Paulo Silva Filho , Claudio Persello , Raian V. Maretto , Renato Machado","doi":"10.1016/j.isprsjprs.2024.09.019","DOIUrl":"10.1016/j.isprsjprs.2024.09.019","url":null,"abstract":"<div><div>The Brazilian savanna (Cerrado) is considered a hotspot for conservation. Despite its environmental and social importance, the biome has suffered a rapid transformation process due to human activities. Mapping and monitoring the remaining vegetation is essential to guide public policies for biodiversity conservation. However, accurately mapping the Cerrado’s vegetation is still an open challenge. Its diverse but spectrally similar physiognomies are a source of confusion for state-of-the-art (SOTA) methods. This study proposes a deep learning model to map the natural vegetation of the Cerrado at the regional to biome level, fusing Synthetic Aperture Radar (SAR) and optical data. The proposed model is designed to deal with uncertainties caused by the different resolutions of the input Sentinel-1/2 images (10 m) and the reference data, derived from Landsat images (30 m). We designed a multi-resolution label-propagation (MRLP) module that infers maps at both resolutions and uses the class scores from the 30 m output as features for the 10 m classification layer. We train the model with the proposed calibrated dual focal loss function in a 2-stage hierarchical manner. Our results reached an overall accuracy of 70.37%, representing an increase of 15.64% compared to a SOTA random forest (RF) model. Moreover, we propose an uncertainty quantification method, which has shown to be useful not only in validating the model, but also in highlighting areas of label noise in the reference. The developed codes and dataset are available on <span><span>Github</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 405-421"},"PeriodicalIF":10.6,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SDCINet: A novel cross-task integration network for segmentation and detection of damaged/changed building targets with optical remote sensing imagery SDCINet：利用光学遥感图像分割和检测受损/变化建筑目标的新型跨任务集成网络

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-26 DOI: 10.1016/j.isprsjprs.2024.09.024

Haiming Zhang , Guorui Ma , Hongyang Fan , Hongyu Gong , Di Wang , Yongxian Zhang

Buildings are primary locations for human activities and key focuses in the military domain. Rapidly detecting damaged/changed buildings (DCB) and conducting detailed assessments can effectively aid urbanization monitoring, disaster response, and humanitarian assistance. Currently, the tasks of object detection (OD) and change detection (CD) for DCB are almost independent of each other, making it difficult to simultaneously determine the location and details of changes. Based on this, we have designed a cross-task network called SDCINet, which integrates OD and CD, and have created four dual-task datasets focused on disasters and urbanization. SDCINet is a novel deep learning dual-task framework composed of a consistency encoder, differentiation decoder, and cross-task global attention collaboration module (CGAC). It is capable of modeling differential feature relationships based on bi-temporal images, performing end-to-end pixel-level prediction, and object bounding box regression. The bi-direction traction function of CGAC is used to deeply couple OD and CD tasks. Additionally, we collected bi-temporal images from 10 locations worldwide that experienced earthquakes, explosions, wars, and conflicts to construct two datasets specifically for damaged building OD and CD. We also constructed two datasets for changed building OD and CD based on two publicly available CD datasets. These four datasets can serve as data benchmarks for dual-task research on DCB. Using these datasets, we conducted extensive performance evaluations of 18 state-of-the-art models from the perspectives of OD, CD, and instance segmentation. Benchmark experimental results demonstrated the superior performance of SDCINet. Ablation experiments and evaluative analyses confirmed the effectiveness and unique value of CGAC.

建筑物是人类活动的主要场所，也是军事领域的关键重点。快速检测受损/发生变化的建筑物（DCB）并进行详细评估可有效帮助城市化监测、灾难响应和人道主义援助。目前，DCB 的目标检测（OD）和变化检测（CD）任务几乎是相互独立的，因此很难同时确定变化的位置和细节。在此基础上，我们设计了一种名为 SDCINet 的跨任务网络，它集成了 OD 和 CD，并创建了四个以灾害和城市化为重点的双任务数据集。SDCINet 是一个新颖的深度学习双任务框架，由一致性编码器、差异化解码器和跨任务全局注意力协作模块（CGAC）组成。它能够基于双时相图像建立差异特征关系模型，执行端到端像素级预测和对象边界框回归。CGAC 的双向牵引功能用于深度耦合 OD 和 CD 任务。此外，我们还从全球 10 个经历过地震、爆炸、战争和冲突的地点收集了双时相图像，构建了两个数据集，专门用于受损建筑的 OD 和 CD。我们还在两个公开的 CD 数据集的基础上，构建了两个变化的建筑外景和内景数据集。这四个数据集可作为 DCB 双任务研究的数据基准。利用这些数据集，我们从 OD、CD 和实例分割的角度对 18 个最先进的模型进行了广泛的性能评估。基准实验结果证明了 SDCINet 的卓越性能。消融实验和评估分析证实了 CGAC 的有效性和独特价值。

{"title":"SDCINet: A novel cross-task integration network for segmentation and detection of damaged/changed building targets with optical remote sensing imagery","authors":"Haiming Zhang , Guorui Ma , Hongyang Fan , Hongyu Gong , Di Wang , Yongxian Zhang","doi":"10.1016/j.isprsjprs.2024.09.024","DOIUrl":"10.1016/j.isprsjprs.2024.09.024","url":null,"abstract":"<div><div>Buildings are primary locations for human activities and key focuses in the military domain. Rapidly detecting damaged/changed buildings (DCB) and conducting detailed assessments can effectively aid urbanization monitoring, disaster response, and humanitarian assistance. Currently, the tasks of object detection (OD) and change detection (CD) for DCB are almost independent of each other, making it difficult to simultaneously determine the location and details of changes. Based on this, we have designed a cross-task network called SDCINet, which integrates OD and CD, and have created four dual-task datasets focused on disasters and urbanization. SDCINet is a novel deep learning dual-task framework composed of a consistency encoder, differentiation decoder, and cross-task global attention collaboration module (CGAC). It is capable of modeling differential feature relationships based on bi-temporal images, performing end-to-end pixel-level prediction, and object bounding box regression. The bi-direction traction function of CGAC is used to deeply couple OD and CD tasks. Additionally, we collected bi-temporal images from 10 locations worldwide that experienced earthquakes, explosions, wars, and conflicts to construct two datasets specifically for damaged building OD and CD. We also constructed two datasets for changed building OD and CD based on two publicly available CD datasets. These four datasets can serve as data benchmarks for dual-task research on DCB. Using these datasets, we conducted extensive performance evaluations of 18 state-of-the-art models from the perspectives of OD, CD, and instance segmentation. Benchmark experimental results demonstrated the superior performance of SDCINet. Ablation experiments and evaluative analyses confirmed the effectiveness and unique value of CGAC.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 422-446"},"PeriodicalIF":10.6,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reconstructing high-resolution subsurface temperature of the global ocean using deep forest with combined remote sensing and in situ observations 利用结合遥感和现场观测的深林重建全球海洋的高分辨率次表层温度

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-25 DOI: 10.1016/j.isprsjprs.2024.09.022

Hua Su , Feiyan Zhang , Jianchen Teng , An Wang , Zhanchao Huang

Estimating high-resolution ocean subsurface temperature has great importance for the refined study of ocean climate variability and change. However, the insufficient resolution and accuracy of subsurface temperature data greatly limits our comprehensive understanding of mesoscale and other fine-scale ocean processes. In this study, we integrated multiple remote sensing data and in situ observations to compare four models within two frameworks (gradient boosting and deep learning). The optimal model, Deep Forest, was selected to generate a high-resolution subsurface temperature dataset (DORS0.25°) for the upper 2000 m from 1993 to 2023. DORS0.25° exhibits excellent reconstruction accuracy, with an average R² of 0.980 and RMSE of 0.579 °C, and the monthly average accuracy is higher than IAP and ORAS5 datasets. Particularly, DORS0.25° can effectively capture detailed ocean warming characteristics in complex dynamic regions such as the Gulf Stream and the Kuroshio Extension, facilitating the study of mesoscale processes and warming within the global-scale ocean. Moreover, the research highlights that the rate of warming over the past decade has been significant, and ocean warming has consistently reached new highs since 2019. This study has demonstrated that DORS0.25° is a crucial dataset for understanding and monitoring the spatiotemporal characteristics and processes of global ocean warming, providing valuable data support for the sustainable development of the marine environment and climate change actions.

估算高分辨率海洋次表层温度对精细研究海洋气候多变性和变化具有重要意义。然而，由于次表层温度数据的分辨率和精度不足，极大地限制了我们对中尺度和其他精细尺度海洋过程的全面了解。在这项研究中，我们整合了多种遥感数据和现场观测数据，在两个框架（梯度提升和深度学习）内对四个模型进行了比较。我们选择了最优模型 "深林 "来生成 1993 年至 2023 年 2000 米上层的高分辨率次表层温度数据集（DORS0.25°）。DORS0.25° 的重建精度非常高，平均 R2 为 0.980，RMSE 为 0.579°C，月平均精度高于 IAP 和 ORAS5 数据集。特别是，DORS0.25°能有效捕捉湾流和黑潮延伸等复杂动态区域的详细海洋变暖特征，有助于研究全球尺度海洋的中尺度过程和变暖。此外，研究还强调，过去十年的变暖速度非常显著，自 2019 年以来，海洋变暖持续创下新高。这项研究表明，DORS0.25°是了解和监测全球海洋变暖时空特征和过程的重要数据集，为海洋环境可持续发展和气候变化行动提供了宝贵的数据支持。

{"title":"Reconstructing high-resolution subsurface temperature of the global ocean using deep forest with combined remote sensing and in situ observations","authors":"Hua Su , Feiyan Zhang , Jianchen Teng , An Wang , Zhanchao Huang","doi":"10.1016/j.isprsjprs.2024.09.022","DOIUrl":"10.1016/j.isprsjprs.2024.09.022","url":null,"abstract":"<div><div>Estimating high-resolution ocean subsurface temperature has great importance for the refined study of ocean climate variability and change. However, the insufficient resolution and accuracy of subsurface temperature data greatly limits our comprehensive understanding of mesoscale and other fine-scale ocean processes. In this study, we integrated multiple remote sensing data and <em>in situ</em> observations to compare four models within two frameworks (gradient boosting and deep learning). The optimal model, Deep Forest, was selected to generate a high-resolution subsurface temperature dataset (DORS0.25°) for the upper 2000 m from 1993 to 2023. DORS0.25° exhibits excellent reconstruction accuracy, with an average <em>R</em><sup>2</sup> of 0.980 and RMSE of 0.579 °C, and the monthly average accuracy is higher than IAP and ORAS5 datasets. Particularly, DORS0.25° can effectively capture detailed ocean warming characteristics in complex dynamic regions such as the Gulf Stream and the Kuroshio Extension, facilitating the study of mesoscale processes and warming within the global-scale ocean. Moreover, the research highlights that the rate of warming over the past decade has been significant, and ocean warming has consistently reached new highs since 2019. This study has demonstrated that DORS0.25° is a crucial dataset for understanding and monitoring the spatiotemporal characteristics and processes of global ocean warming, providing valuable data support for the sustainable development of the marine environment and climate change actions.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 389-404"},"PeriodicalIF":10.6,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142319168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A unified feature-motion consistency framework for robust image matching 用于稳健图像匹配的统一特征-运动一致性框架

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-25 DOI: 10.1016/j.isprsjprs.2024.09.021

Yan Zhou, Jinding Gao, Xiaoping Liu

Establishing reliable feature matches between a pair of images in various scenarios is a long-standing open problem in photogrammetry. Attention-based detector-free matching with coarse-to-fine architecture has been a typical pipeline to build matches, but the cross-attention module with global receptive field may compromise the structural local consistency by introducing irrelevant regions (outliers). Motion field can maintain structural local consistency under the assumption that matches for adjacent features should be spatially proximate. However, motion field can only estimate local displacements between consecutive images and struggle with long-range displacements estimation in large-scale variation scenarios without spatial correlation priors. Moreover, large-scale variations may also disrupt the geometric consistency with the application of mutual nearest neighbor criterion in patch-level matching, making it difficult to recover accurate matches. In this paper, we propose a unified feature-motion consistency framework for robust image matching (MOMA), to maintain structural consistency at both global and local granularity in scale-discrepancy scenarios. MOMA devises a motion consistency-guided dependency range strategy (MDR) in cross attention, aggregating highly relevant regions within the motion consensus-restricted neighborhood to favor true matchable regions. Meanwhile, a unified framework with hierarchical attention structure is established to couple local motion field with global feature correspondence. The motion field provides local consistency constraints in feature aggregation, while feature correspondence provides spatial context prior to improve motion field estimation. To alleviate geometric inconsistency caused by hard nearest neighbor criterion, we propose an adaptive neighbor search (soft) strategy to address scale discrepancy. Extensive experiments on three datasets demonstrate that our method outperforms solid baselines, with AUC improvements of 4.73/4.02/3.34 in two-view pose estimation task at thresholds of 5°/10°/20° on Megadepth test, and 5.94% increase of accuracy at threshold of 1px in homography task on HPatches datasets. Furthermore, in the downstream tasks such as 3D mapping, the 3D models reconstructed using our method on the self-collected SYSU UAV datasets exhibit significant improvement in structural completeness and detail richness, manifesting its high applicability in wide downstream tasks. The code is publicly available at https://github.com/BunnyanChou/MOMA.

在各种情况下，在一对图像之间建立可靠的特征匹配是摄影测量中一个长期存在的难题。基于注意力的无检测器粗到细结构匹配一直是建立匹配的典型方法，但具有全局感受野的交叉注意力模块可能会引入无关区域（异常值），从而影响结构的局部一致性。运动场可以保持结构上的局部一致性，前提是相邻特征的匹配应在空间上接近。然而，运动场只能估计连续图像之间的局部位移，在没有空间相关性先验的大规模变化情况下，难以估计长距离位移。此外，大尺度变化还可能破坏补丁级匹配中应用的互为近邻准则的几何一致性，从而难以恢复准确的匹配。在本文中，我们提出了用于鲁棒图像匹配的统一特征-运动一致性框架（MOMA），以在尺度差异场景中保持全局和局部粒度的结构一致性。MOMA 在交叉注意中设计了一种以运动一致性为指导的依赖范围策略（MDR），将运动共识限制邻域内的高度相关区域聚合起来，以偏向于真正的可匹配区域。同时，建立了一个具有分层注意结构的统一框架，将局部运动场与全局特征对应关系结合起来。运动场为特征聚合提供局部一致性约束，而特征对应则为改进运动场估计提供空间上下文先验。为了缓解硬性近邻标准造成的几何不一致性，我们提出了一种自适应邻域搜索（软性）策略来解决尺度差异问题。在三个数据集上进行的广泛实验表明，我们的方法优于固体基线，在 Megadepth 测试中，当阈值为 5°/10°/20° 时，双视角姿势估计任务的 AUC 提高了 4.73/4.02/3.34；在 HPatches 数据集上，当阈值为 1px 时，同构任务的准确率提高了 5.94%。此外，在三维测绘等下游任务中，使用我们的方法在自收集的 SYSU 无人机数据集上重建的三维模型在结构完整性和细节丰富度方面都有显著提高，这表明它在广泛的下游任务中具有很高的适用性。代码可在 https://github.com/BunnyanChou/MOMA 上公开获取。

{"title":"A unified feature-motion consistency framework for robust image matching","authors":"Yan Zhou, Jinding Gao, Xiaoping Liu","doi":"10.1016/j.isprsjprs.2024.09.021","DOIUrl":"10.1016/j.isprsjprs.2024.09.021","url":null,"abstract":"<div><div>Establishing reliable feature matches between a pair of images in various scenarios is a long-standing open problem in photogrammetry. Attention-based detector-free matching with coarse-to-fine architecture has been a typical pipeline to build matches, but the cross-attention module with global receptive field may compromise the structural local consistency by introducing irrelevant regions (outliers). Motion field can maintain structural local consistency under the assumption that matches for adjacent features should be spatially proximate. However, motion field can only estimate local displacements between consecutive images and struggle with long-range displacements estimation in large-scale variation scenarios without spatial correlation priors. Moreover, large-scale variations may also disrupt the geometric consistency with the application of mutual nearest neighbor criterion in patch-level matching, making it difficult to recover accurate matches. In this paper, we propose a unified feature-motion consistency framework for robust image matching (MOMA), to maintain structural consistency at both global and local granularity in scale-discrepancy scenarios. MOMA devises a motion consistency-guided dependency range strategy (MDR) in cross attention, aggregating highly relevant regions within the motion consensus-restricted neighborhood to favor true matchable regions. Meanwhile, a unified framework with hierarchical attention structure is established to couple local motion field with global feature correspondence. The motion field provides local consistency constraints in feature aggregation, while feature correspondence provides spatial context prior to improve motion field estimation. To alleviate geometric inconsistency caused by hard nearest neighbor criterion, we propose an adaptive neighbor search (soft) strategy to address scale discrepancy. Extensive experiments on three datasets demonstrate that our method outperforms solid baselines, with AUC improvements of 4.73/4.02/3.34 in two-view pose estimation task at thresholds of 5°/10°/20° on Megadepth test, and 5.94% increase of accuracy at threshold of 1px in homography task on HPatches datasets. Furthermore, in the downstream tasks such as 3D mapping, the 3D models reconstructed using our method on the self-collected SYSU UAV datasets exhibit significant improvement in structural completeness and detail richness, manifesting its high applicability in wide downstream tasks. The code is publicly available at <span><span>https://github.com/BunnyanChou/MOMA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 368-388"},"PeriodicalIF":10.6,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142319169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Small object change detection in UAV imagery via a Siamese network enhanced with temporal mutual attention and contextual features: A case study concerning solar water heaters 利用时间相互关注和上下文特征增强的连体网络检测无人机图像中的小物体变化：太阳能热水器案例研究

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-25 DOI: 10.1016/j.isprsjprs.2024.09.027

Shikang Tao, Mengyuan Yang, Min Wang, Rui Yang, Qian Shen

Small object change detection (SOCD) based on high-spatial resolution (HSR) images is of significant practical value in applications such as the investigation of illegal urban construction, but little research is currently available. This study proposes an SOCD model called TMACNet based on a multitask network architecture. The model modifies the YOLOv8 network into a Siamese network and adds structures, including a feature difference branch (FDB), temporal mutual attention layer (TMAL) and contextual attention module (CAM), to merge differential and contextual features from different phases for the accurate extraction and analysis of small objects and their changes. To verify the proposed method, an SOCD dataset called YZDS is created based on unmanned aerial vehicle (UAV) images of small-scale solar water heaters on rooftops. The experimental results show that TMACNet exhibits strong resistance to image registration errors and building height displacement and prevents error propagation from object detection to change detection originating from overlay-based change detection. TMACNet also provides an enhanced approach to small object detection from the perspective of multitemporal information fusion. In the change detection task, TMACNet exhibits notable F1 improvements exceeding 5.96% in comparison with alternative change detection methods. In the object detection task, TMACNet outperforms the single-temporal object detection models, increasing accuracy with an approximately 1–3% improvement in the AP metric while simplifying the technical process.

基于高空间分辨率（HSR）图像的小物体变化检测（SOCD）在调查城市非法建筑等应用中具有重要的实用价值，但目前这方面的研究还很少。本研究提出了一种基于多任务网络架构的 SOCD 模型 TMACNet。该模型将 YOLOv8 网络修改为连体网络，并增加了包括特征差异分支（FDB）、时间相互注意层（TMAL）和上下文注意模块（CAM）在内的结构，以合并不同阶段的差异特征和上下文特征，从而准确提取和分析小物体及其变化。为了验证所提出的方法，我们根据无人机拍摄的屋顶小型太阳能热水器图像创建了一个名为 YZDS 的 SOCD 数据集。实验结果表明，TMACNet 对图像配准误差和建筑物高度位移具有很强的抗干扰能力，并能防止误差从对象检测传播到基于叠加的变化检测。TMACNet 还从多时空信息融合的角度为小物体检测提供了一种增强方法。在变化检测任务中，与其他变化检测方法相比，TMACNet 的 F1 显著提高，超过 5.96%。在物体检测任务中，TMACNet 的表现优于单时相物体检测模型，在简化技术流程的同时提高了准确性，AP 指标提高了约 1-3%。

{"title":"Small object change detection in UAV imagery via a Siamese network enhanced with temporal mutual attention and contextual features: A case study concerning solar water heaters","authors":"Shikang Tao, Mengyuan Yang, Min Wang, Rui Yang, Qian Shen","doi":"10.1016/j.isprsjprs.2024.09.027","DOIUrl":"10.1016/j.isprsjprs.2024.09.027","url":null,"abstract":"<div><div>Small object change detection (SOCD) based on high-spatial resolution (HSR) images is of significant practical value in applications such as the investigation of illegal urban construction, but little research is currently available. This study proposes an SOCD model called TMACNet based on a multitask network architecture. The model modifies the YOLOv8 network into a Siamese network and adds structures, including a feature difference branch (FDB), temporal mutual attention layer (TMAL) and contextual attention module (CAM), to merge differential and contextual features from different phases for the accurate extraction and analysis of small objects and their changes. To verify the proposed method, an SOCD dataset called YZDS is created based on unmanned aerial vehicle (UAV) images of small-scale solar water heaters on rooftops. The experimental results show that TMACNet exhibits strong resistance to image registration errors and building height displacement and prevents error propagation from object detection to change detection originating from overlay-based change detection. TMACNet also provides an enhanced approach to small object detection from the perspective of multitemporal information fusion. In the change detection task, TMACNet exhibits notable F1 improvements exceeding 5.96% in comparison with alternative change detection methods. In the object detection task, TMACNet outperforms the single-temporal object detection models, increasing accuracy with an approximately 1–3% improvement in the AP metric while simplifying the technical process.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 352-367"},"PeriodicalIF":10.6,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142318925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time series sUAV data reveal moderate accuracy and large uncertainties in spring phenology metric of deciduous broadleaf forest as estimated by vegetation index-based phenological models 基于植被指数的物候模型估算的落叶阔叶林春季物候指标的时间序列 sUAV 数据显示出中等精度和较大的不确定性

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-24 DOI: 10.1016/j.isprsjprs.2024.09.023

Li Pan , Xiangming Xiao , Haoming Xia , Xiaoyan Ma , Yanhua Xie , Baihong Pan , Yuanwei Qin

Accurate delineation of spring phenology (e.g., start of growing season, SOS) of deciduous forests is essential for understanding its responses to environmental changes. To date, SOS dates from analyses of satellite images and vegetation index (VI) −based phenological models have notable discrepancies but they have not been fully evaluated, primarily due to the lack of ground reference data for evaluation. This study evaluated the SOS dates of a deciduous broadleaf forest estimated by VI-based phenological models from three satellite sensors (PlanetScope, Sentinel-2A/B, and Landsat-7/8/9) by using ground reference data collected by a small unmanned aerial vehicle (sUAV). Daily sUAV imagery (0.035-meter resolution) was used to identify and generate green leaf maps. These green leaf maps were further aggregated to generate Green Leaf Fraction (GLF) maps at the spatial resolutions of PlanetScope (3-meter), Sentinel-2A/B (10-meter), and Landsat-7/8/9 (30-meter). The temporal changes of GLF differ from those of vegetation indices in spring, with the peak dates of GLF being much earlier than those of VIs. At the SOS dates estimated by VI-based phenological models in 2022 (Julian days from 105 to 111), GLF already ranges from 62% to 96%. The moderate accuracy and large uncertainties of SOS dates from VI-based phenological models arise from the limitations of vegetation indices in accurately tracking the number of green leaves and the inherent uncertainties of the mathematical models used. The results of this study clearly highlight the need for new research on spring phenology of deciduous forests.

准确划分落叶林的春季物候（如生长季节的开始，SOS）对于了解其对环境变化的反应至关重要。迄今为止，通过分析卫星图像和基于植被指数（VI）的物候模型得出的 SOS 日期存在明显差异，但尚未对其进行全面评估，主要原因是缺乏用于评估的地面参考数据。本研究利用小型无人飞行器（sUAV）收集的地面参考数据，评估了基于植被指数的物候模型通过三种卫星传感器（PlanetScope、Sentinel-2A/B 和 Landsat-7/8/9）估算的落叶阔叶林的 SOS 日期。小型无人飞行器的每日图像（0.035 米分辨率）用于识别和生成绿叶地图。这些绿叶图经过进一步汇总，生成了空间分辨率分别为 PlanetScope（3 米）、Sentinel-2A/B（10 米）和 Landsat-7/8/9（30 米）的绿叶比例（GLF）图。GLF 的时间变化与春季植被指数的时间变化不同，GLF 的峰值日期比 VIs 的峰值日期要早得多。在 2022 年基于 VI 的物候模型估计的 SOS 日期（朱利安日从 105 天到 111 天），GLF 已达到 62% 到 96%。由于植被指数在准确跟踪绿叶数量方面的局限性和所使用数学模型的固有不确定性，基于VI的物候模型估算的SOS日期的准确性一般，不确定性较大。这项研究的结果明确强调了对落叶林春季物候进行新研究的必要性。

{"title":"Time series sUAV data reveal moderate accuracy and large uncertainties in spring phenology metric of deciduous broadleaf forest as estimated by vegetation index-based phenological models","authors":"Li Pan , Xiangming Xiao , Haoming Xia , Xiaoyan Ma , Yanhua Xie , Baihong Pan , Yuanwei Qin","doi":"10.1016/j.isprsjprs.2024.09.023","DOIUrl":"10.1016/j.isprsjprs.2024.09.023","url":null,"abstract":"<div><div>Accurate delineation of spring phenology (e.g., start of growing season, SOS) of deciduous forests is essential for understanding its responses to environmental changes. To date, SOS dates from analyses of satellite images and vegetation index (VI) −based phenological models have notable discrepancies but they have not been fully evaluated, primarily due to the lack of ground reference data for evaluation. This study evaluated the SOS dates of a deciduous broadleaf forest estimated by VI-based phenological models from three satellite sensors (PlanetScope, Sentinel-2A/B, and Landsat-7/8/9) by using ground reference data collected by a small unmanned aerial vehicle (sUAV). Daily sUAV imagery (0.035-meter resolution) was used to identify and generate green leaf maps. These green leaf maps were further aggregated to generate Green Leaf Fraction (GLF) maps at the spatial resolutions of PlanetScope (3-meter), Sentinel-2A/B (10-meter), and Landsat-7/8/9 (30-meter). The temporal changes of GLF differ from those of vegetation indices in spring, with the peak dates of GLF being much earlier than those of VIs. At the SOS dates estimated by VI-based phenological models in 2022 (Julian days from 105 to 111), GLF already ranges from 62% to 96%. The moderate accuracy and large uncertainties of SOS dates from VI-based phenological models arise from the limitations of vegetation indices in accurately tracking the number of green leaves and the inherent uncertainties of the mathematical models used. The results of this study clearly highlight the need for new research on spring phenology of deciduous forests.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 339-351"},"PeriodicalIF":10.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142314145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture 预测梯度更好：利用联合嵌入式预测架构探索 SAR ATR 的自监督学习

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-23 DOI: 10.1016/j.isprsjprs.2024.09.013

Weijie Li , Wei Yang , Tianpeng Liu , Yuenan Hou , Yuxuan Li , Zhen Liu , Yongxiang Liu , Li Liu

The growing Synthetic Aperture Radar (SAR) data can build a foundation model using self-supervised learning (SSL) methods, which can achieve various SAR automatic target recognition (ATR) tasks with pretraining in large-scale unlabeled data and fine-tuning in small-labeled samples. SSL aims to construct supervision signals directly from the data, minimizing the need for expensive expert annotation and maximizing the use of the expanding data pool for a foundational model. This study investigates an effective SSL method for SAR ATR, which can pave the way for a foundation model in SAR ATR. The primary obstacles faced in SSL for SAR ATR are small targets in remote sensing and speckle noise in SAR images, corresponding to the SSL approach and signals. To overcome these challenges, we present a novel joint-embedding predictive architecture for SAR ATR (SAR-JEPA) thatleverages local masked patches to predict the multi-scale SAR gradient representations of an unseen context. The key aspect of SAR-JEPA is integrating SAR domain features to ensure high-quality self-supervised signals as target features. In addition, we employ local masks and multi-scale features to accommodate various small targets in remote sensing. By fine-tuning and evaluating our framework on three target recognition datasets (vehicle, ship, and aircraft) with four other datasets as pretraining, we demonstrate its outperformance over other SSL methods and its effectiveness as the SAR data increases. This study demonstrates the potential of SSL for the recognition of SAR targets across diverse targets, scenes, and sensors. Our codes and weights are available in https://github.com/waterdisappear/SAR-JEPA.

不断增长的合成孔径雷达（SAR）数据可以利用自监督学习（SSL）方法建立基础模型，通过在大规模无标注数据中进行预训练和在小标注样本中进行微调，实现各种 SAR 自动目标识别（ATR）任务。SSL 旨在直接从数据中构建监督信号，最大限度地减少对昂贵的专家标注的需求，并最大限度地利用不断扩大的数据池建立基础模型。本研究探讨了一种适用于 SAR ATR 的有效 SSL 方法，它可以为 SAR ATR 的基础模型铺平道路。SAR ATR SSL 面临的主要障碍是遥感中的小目标和 SAR 图像中的斑点噪声，与 SSL 方法和信号相对应。为了克服这些挑战，我们提出了一种用于 SAR ATR 的新型联合嵌入式预测架构（SAR-JEPA），该架构利用局部遮蔽斑块来预测未见环境的多尺度 SAR 梯度表示。SAR-JEPA 的关键在于整合 SAR 域特征，确保将高质量的自监督信号作为目标特征。此外，我们还采用了局部掩码和多尺度特征，以适应遥感中的各种小型目标。通过在三个目标识别数据集（车辆、船舶和飞机）上对我们的框架进行微调和评估，并将其他四个数据集作为预训练，我们证明了该框架的性能优于其他 SSL 方法，而且随着合成孔径雷达数据的增加，该框架也非常有效。这项研究证明了 SSL 在识别不同目标、场景和传感器的合成孔径雷达目标方面的潜力。我们的代码和权重见 https://github.com/waterdisappear/SAR-JEPA。

{"title":"Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture","authors":"Weijie Li , Wei Yang , Tianpeng Liu , Yuenan Hou , Yuxuan Li , Zhen Liu , Yongxiang Liu , Li Liu","doi":"10.1016/j.isprsjprs.2024.09.013","DOIUrl":"10.1016/j.isprsjprs.2024.09.013","url":null,"abstract":"<div><div>The growing Synthetic Aperture Radar (SAR) data can build a foundation model using self-supervised learning (SSL) methods, which can achieve various SAR automatic target recognition (ATR) tasks with pretraining in large-scale unlabeled data and fine-tuning in small-labeled samples. SSL aims to construct supervision signals directly from the data, minimizing the need for expensive expert annotation and maximizing the use of the expanding data pool for a foundational model. This study investigates an effective SSL method for SAR ATR, which can pave the way for a foundation model in SAR ATR. The primary obstacles faced in SSL for SAR ATR are small targets in remote sensing and speckle noise in SAR images, corresponding to the SSL approach and signals. To overcome these challenges, we present a novel joint-embedding predictive architecture for SAR ATR (SAR-JEPA) thatleverages local masked patches to predict the multi-scale SAR gradient representations of an unseen context. The key aspect of SAR-JEPA is integrating SAR domain features to ensure high-quality self-supervised signals as target features. In addition, we employ local masks and multi-scale features to accommodate various small targets in remote sensing. By fine-tuning and evaluating our framework on three target recognition datasets (vehicle, ship, and aircraft) with four other datasets as pretraining, we demonstrate its outperformance over other SSL methods and its effectiveness as the SAR data increases. This study demonstrates the potential of SSL for the recognition of SAR targets across diverse targets, scenes, and sensors. Our codes and weights are available in <span><span>https://github.com/waterdisappear/SAR-JEPA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 326-338"},"PeriodicalIF":10.6,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142312982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding 同质标记器很重要用于遥感图像理解的同质视觉标记器

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-21 DOI: 10.1016/j.isprsjprs.2024.09.009

Run Shao , Zhaoyang Zhang , Chao Tao , Yunsheng Zhang , Chengli Peng , Haifeng Li

On the basis of the transformer architecture and the pretext task of “next-token prediction”, multimodal large language models (MLLMs) are revolutionizing the paradigm in the field of remote sensing image understanding. However, the tokenizer, as one of the fundamental components of MLLMs, has long been overlooked or even misunderstood in visual tasks. A key factor contributing to the great comprehension power of large language models is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision. Analogous to words or subwords in language, we define semantically independent regions (SIRs) for vision and then propose two properties that an ideal visual tokenizer should possess: (1) homogeneity, where SIRs serve as the basic elements of vision, and (2) adaptivity, which allows for a flexible number of tokens to accommodate images of any size and tasks of any granularity. On this basis, we design a simple HOmogeneous visual tOKenizer: HOOK. HOOK consists of two modules: an object perception module (OPM) and an object vectorization module (OVM). To achieve homogeneity, the OPM splits the image into 4 × 4 pixel seeds and then uses a self-attention mechanism to identify SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM predefines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of the token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19, and NaSC-TG2 classification datasets for sparse tasks and the GID5 and DGLCC segmentation datasets for dense tasks. The results show that the visual tokens obtained by HOOK correspond to individual objects, thereby verifying their homogeneity. Compared with randomly initialized or pretrained Patch Embed, which required more than one hundred tokens per image, HOOK required only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in performance improvements of 2% to 10% and efficiency improvements of 1.5 to 2.8 times. The homogeneity and adaptability of the proposed approach provide new perspectives for the study of visual tokenizers. Guided by these principles, the developed HOOK has the potential to replace traditional Patch Embed. The code is available at https://github.com/GeoX-Lab/Hook.

多模态大语言模型（MLLMs）以转换器架构和 "下一个标记预测 "的前置任务为基础，正在遥感图像理解领域掀起一场范式革命。然而，作为多模态大语言模型的基本组成部分之一，标记符号生成器在视觉任务中长期被忽视甚至误解。大型语言模型之所以具有强大的理解能力，一个关键因素是自然语言标记器利用有意义的词或子词作为语言的基本元素。相比之下，以 Patch Embed 等基于补丁的方法为代表的主流视觉标记器则依赖无意义的矩形补丁作为视觉的基本元素。与语言中的单词或子单词类似，我们为视觉定义了语义独立区域（SIR），然后提出了理想的视觉标记器应具备的两个特性：（1）同质性，即 SIR 作为视觉的基本元素；（2）自适应性，即允许灵活的标记数量，以适应任何大小的图像和任何粒度的任务。在此基础上，我们设计了一个简单的 HOmogeneous 视觉识别器：HOOK。HOOK 由两个模块组成：对象感知模块（OPM）和对象矢量化模块（OVM）。为了实现同质化，OPM 将图像分割成 4 × 4 像素种子，然后使用自注意机制来识别 SIR。OVM 采用交叉注意来合并同一 SIR 中的种子。为了实现适应性，OVM 预定义了数量可变的可学习向量作为交叉注意查询，以便调整标记数量。我们在 NWPU-RESISC45、WHU-RS19 和 NaSC-TG2 分类数据集上进行了稀疏任务实验，在 GID5 和 DGLCC 分割数据集上进行了密集任务实验。结果表明，HOOK 获得的视觉标记与单个对象相对应，从而验证了它们的同质性。与随机初始化或预训练的 Patch Embed（每幅图像需要一百多个标记）相比，HOOK 在稀疏和密集任务中分别只需要 6 个和 8 个标记，性能提高了 2% 至 10%，效率提高了 1.5 至 2.8 倍。所提方法的同质性和适应性为视觉标记化器的研究提供了新的视角。在这些原则的指导下，所开发的 HOOK 有可能取代传统的 Patch Embed。代码见 https://github.com/GeoX-Lab/Hook。

{"title":"Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding","authors":"Run Shao , Zhaoyang Zhang , Chao Tao , Yunsheng Zhang , Chengli Peng , Haifeng Li","doi":"10.1016/j.isprsjprs.2024.09.009","DOIUrl":"10.1016/j.isprsjprs.2024.09.009","url":null,"abstract":"<div><p>On the basis of the transformer architecture and the pretext task of “next-token prediction”, multimodal large language models (MLLMs) are revolutionizing the paradigm in the field of remote sensing image understanding. However, the tokenizer, as one of the fundamental components of MLLMs, has long been overlooked or even misunderstood in visual tasks. A key factor contributing to the great comprehension power of large language models is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision. Analogous to words or subwords in language, we define semantically independent regions (SIRs) for vision and then propose two properties that an ideal visual tokenizer should possess: (1) homogeneity, where SIRs serve as the basic elements of vision, and (2) adaptivity, which allows for a flexible number of tokens to accommodate images of any size and tasks of any granularity. On this basis, we design a simple HOmogeneous visual tOKenizer: HOOK. HOOK consists of two modules: an object perception module (OPM) and an object vectorization module (OVM). To achieve homogeneity, the OPM splits the image into 4 × 4 pixel seeds and then uses a self-attention mechanism to identify SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM predefines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of the token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19, and NaSC-TG2 classification datasets for sparse tasks and the GID5 and DGLCC segmentation datasets for dense tasks. The results show that the visual tokens obtained by HOOK correspond to individual objects, thereby verifying their homogeneity. Compared with randomly initialized or pretrained Patch Embed, which required more than one hundred tokens per image, HOOK required only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in performance improvements of 2% to 10% and efficiency improvements of 1.5 to 2.8 times. The homogeneity and adaptability of the proposed approach provide new perspectives for the study of visual tokenizers. Guided by these principles, the developed HOOK has the potential to replace traditional Patch Embed. The code is available at <span><span>https://github.com/GeoX-Lab/Hook</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 294-310"},"PeriodicalIF":10.6,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142270436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VDFT: Robust feature matching of aerial and ground images using viewpoint-invariant deformable feature transformation VDFT：利用视点不变的可变形特征变换对航空和地面图像进行稳健的特征匹配

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing

Pub Date : 2024-09-21 DOI: 10.1016/j.isprsjprs.2024.09.016

Bai Zhu , Yuanxin Ye , Jinkun Dai , Tao Peng , Jiwei Deng , Qing Zhu

Establishing accurate correspondences between aerial and ground images is facing immense challenges because of the drastic viewpoint, illumination, and scale variations resulting from significant differences in viewing angles, shoot timing, and imaging mechanisms. To cope with these issues, we propose an effective aerial-to-ground feature matching method, named Viewpoint-invariant Deformable Feature Transformation (VDFT), which aims to comprehensively enhance the discrimination of local features by utilizing deformable convolutional network (DCN) and seed attention mechanism. Specifically, the proposed VDFT is constructed consisting of three pivotal modules: (1) a learnable deformable feature network is established by using DCN and Depthwise Separable Convolution (DSC) to obtain dynamic receptive fields, addressing local geometric deformations caused by viewpoint variation; (2) an improved joint detection and description strategy is presented through concurrently sharing the multi-level deformable feature representation to enhance the localization accuracy and representation capabilities of feature points; and (3) a seed attention matching module is built by introducing self- and cross- seed attention mechanisms to improve the performance and efficiency for aerial-to-ground feature matching. Finally, we conduct thorough experiments to verify the matching performance of our VDFT on five challenging aerial-to-ground datasets. Extensive experimental evaluations prove that our VDFT is more resistant to perspective distortion and drastic variations in viewpoint, illumination, and scale. It exhibits satisfactory matching performance and outperforms the current state-of-the-art (SOTA) methods in terms of robustness and accuracy.

由于观察角度、拍摄时机和成像机制的显著差异导致视角、光照和比例尺的剧烈变化，因此在航空图像和地面图像之间建立精确的对应关系面临着巨大的挑战。针对这些问题，我们提出了一种有效的航拍-地面特征匹配方法，命名为视点不变可变形特征变换（VDFT），旨在利用可变形卷积网络（DCN）和种子关注机制，全面提高局部特征的判别能力。具体来说，所提出的 VDFT 由三个关键模块组成：(1）利用 DCN 和深度可分离卷积（DSC）建立可学习的可变形特征网络，获得动态感受野，解决视角变化引起的局部几何变形问题；（2）通过同时共享多级可变形特征表示，提出改进的联合检测和描述策略，提高特征点的定位精度和表示能力；（3）通过引入自种子和交叉种子注意机制，建立种子注意匹配模块，提高空地特征匹配的性能和效率。最后，我们在五个具有挑战性的空地数据集上进行了全面的实验，以验证我们的 VDFT 的匹配性能。广泛的实验评估证明，我们的 VDFT 更能抵御视角失真以及视点、光照和尺度的剧烈变化。它的匹配性能令人满意，在鲁棒性和准确性方面优于目前最先进的（SOTA）方法。

{"title":"VDFT: Robust feature matching of aerial and ground images using viewpoint-invariant deformable feature transformation","authors":"Bai Zhu , Yuanxin Ye , Jinkun Dai , Tao Peng , Jiwei Deng , Qing Zhu","doi":"10.1016/j.isprsjprs.2024.09.016","DOIUrl":"10.1016/j.isprsjprs.2024.09.016","url":null,"abstract":"<div><p>Establishing accurate correspondences between aerial and ground images is facing immense challenges because of the drastic viewpoint, illumination, and scale variations resulting from significant differences in viewing angles, shoot timing, and imaging mechanisms. To cope with these issues, we propose an effective aerial-to-ground feature matching method, named Viewpoint-invariant Deformable Feature Transformation (VDFT), which aims to comprehensively enhance the discrimination of local features by utilizing deformable convolutional network (DCN) and seed attention mechanism. Specifically, the proposed VDFT is constructed consisting of three pivotal modules: (1) a learnable deformable feature network is established by using DCN and Depthwise Separable Convolution (DSC) to obtain dynamic receptive fields, addressing local geometric deformations caused by viewpoint variation; (2) an improved joint detection and description strategy is presented through concurrently sharing the multi-level deformable feature representation to enhance the localization accuracy and representation capabilities of feature points; and (3) a seed attention matching module is built by introducing self- and cross- seed attention mechanisms to improve the performance and efficiency for aerial-to-ground feature matching. Finally, we conduct thorough experiments to verify the matching performance of our VDFT on five challenging aerial-to-ground datasets. Extensive experimental evaluations prove that our VDFT is more resistant to perspective distortion and drastic variations in viewpoint, illumination, and scale. It exhibits satisfactory matching performance and outperforms the current state-of-the-art (SOTA) methods in terms of robustness and accuracy.</p></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 311-325"},"PeriodicalIF":10.6,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142270302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0