{"title":"扩展区域化算法探索空间过程异质性","authors":"Hao Guo, Andre Python, Yu Liu","doi":"10.1080/13658816.2023.2266493","DOIUrl":null,"url":null,"abstract":"AbstractIn spatial regression models, spatial heterogeneity may be considered with either continuous or discrete specifications. The latter is related to delineation of spatially connected regions with homogeneous relationships between variables (spatial regimes). Although various regionalization algorithms have been proposed and studied in the field of spatial analytics, methods to optimize spatial regimes have been largely unexplored. In this paper, we propose two new algorithms for spatial regime delineation, two-stage K-Models and Regional-K-Models. We also extend the classic Automatic Zoning Procedure to a spatial regression context. The proposed algorithms are applied to a series of synthetic datasets and two real-world datasets. Results indicate that all three algorithms achieve superior or comparable performance to existing approaches, while the two-stage K-Models algorithm largely outperforms existing approaches on model fitting, region reconstruction and coefficient estimation. Our work enriches the spatial analytics toolbox to explore spatial heterogeneous processes.Keywords: Regionalizationspatial heterogeneityspatial regimespatial regression NotesAcknowledgmentsThe authors thank members of Spatial Analysis Group, Spatio-temporal Social Sensing Lab for helpful discussion. The constructive comments from anonymous reviewers are gratefully acknowledged.Data and codes availability statementThe data and codes that support the findings of this study are available at https://github.com/Nithouson/regreg.Disclosure statementThe authors declare that they have no conflict of interest.Notes1 For example, the population in each region is required to be as similar as possible or above a predefined value (see Duque et al. (Citation2012), Folch and Spielman (Citation2014), Wei et al. (Citation2021)).2 Note that the optimization of spatial regimes differs from Openshaw (Citation1978), where spatial units are aggregated into areas, and each area is treated as an observation in a global regression model.3 Note that in EquationEquation 1(1) L(R)=∑j=1p∑1≤i1<i2≤nI[ui1,ui2∈Rj]||xi1−xi2||2,(1) , the number of considered unit pairs in the sum is ∑j=1M(|Rj|2), which is smaller if |Rj|(j=1,…,M) are close to each other. Hence the objective function might favor solutions whose regions have similar numbers of units.4 Throughout the paper, we describe the case of lattice data (spatial data on areal units). Our approach is also applicable to point observation data after building adjacency (with k-nearest neighbors (KNN) or Delaunay triangulation, for example).5 This usually happens when min_obs is close to n/p, where p is the number of regions. Given min_obs≪n/p, this issue does not cause problems, as observed in our experiments.6 If a region with inadequate units has two or more neighboring regions, we select the neighbor which minimizes the total SSR after the merge.7 When min_obs is too large or K is too small (close to p), exceptions may occur that the number of regions is less than p, hence the algorithm cannot produce the required number of regions by merging ‘micro-clusters’. This issue can be solved by adjusting min_obs and K.8 Let nr denote the number of units in the region. The OLS estimation of the coefficient vector is β=(XTX)−1XTy, where X is the nr×(m+1) matrix of independent variables, y is the nr-dimensional vector of dependent variable. Here the intercept is included in β by adding an independent variable with constant value 1. By applying the Sherman-Morrison formula (Bartlett Citation1951) to update the (XTX)−1 term, the time complexity can be reduced from O(m2(nr+m)) to O(m(nr+m)).9 Let βi,j denote the value of coefficient βi in region Rj. In each simulation, the list (−2,−1,0,1,2) is randomly shuffled twice and used as (β1,1,…,β1,5) and (β2,1,…,β2,5), respectively.10 Helbich et al. (Citation2013) also applied principal component analysis to the GWR coefficients. This step is skipped, as dimension reduction is not needed in our experiment.11 Different values of min_obs may be used in the two stages of K-Models algorithm. Here min_obs=10 is used for the merge stage, while min_obs in the partition stage is the number of independent variables plus 1 throughout this paper.12 Even considering the average SSR rather than the lowest, two-stage K-Models and AZP consistently outperform GWR-Skater and Skater-reg; Regional-K-Models is comparable to Skater-reg and superior to GWR-Skater.13 The GWR estimation did not complete within 30 minutes on our machine.14 Experiments on King County house price dataset is performed on a computer with an Intel Core i5-1135G7 CPU (2.40 GHz) and 16GB of memory.Additional informationFundingThis research was supported by grants from the National Natural Science Foundation of China (41830645, 42271426, 41971331, 82273731), Smart Guangzhou Spatio-temporal Information Cloud Platform Construction (GZIT2016-A5-147) and the National Key Research and Development Program of China (2021YFC2701905).Notes on contributorsHao GuoHao Guo is currently a Ph.D. candidate at the Institute of Remote Sensing and Geographic Information Systems, Peking University. He received his B.S. in Geographic Information Science and a dual B.S. in Mathematics from Peking University in 2020. His research interests include spatial analytics, geo-spatial artificial intelligence and spatial optimization.Andre PythonAndre Python is ZJU100 Young Professor in Statistics at the Center for Data Science, Zhejiang University, P.R. China. He received his B.S. and M.S. from the University of Fribourg, Switzerland, and his Ph.D. from the University of St Andrews, United Kingdom. He develops and applies spatial models and interpretable machine learning algorithms to better understand the mechanisms behind the observed patterns of spatial phenomena.Yu LiuYu Liu is currently the Boya Professor of GIScience at the Institute of Remote Sensing and Geographic Information Systems, Peking University. He received his B.S., M.S. and Ph.D. degrees from Peking University in 1994, 1997 and 2003, respectively. His research interests mainly focus on humanities and social sciences based on big geo-data.","PeriodicalId":14162,"journal":{"name":"International Journal of Geographical Information Science","volume":"130 1","pages":"0"},"PeriodicalIF":4.3000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extending regionalization algorithms to explore spatial process heterogeneity\",\"authors\":\"Hao Guo, Andre Python, Yu Liu\",\"doi\":\"10.1080/13658816.2023.2266493\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AbstractIn spatial regression models, spatial heterogeneity may be considered with either continuous or discrete specifications. The latter is related to delineation of spatially connected regions with homogeneous relationships between variables (spatial regimes). Although various regionalization algorithms have been proposed and studied in the field of spatial analytics, methods to optimize spatial regimes have been largely unexplored. In this paper, we propose two new algorithms for spatial regime delineation, two-stage K-Models and Regional-K-Models. We also extend the classic Automatic Zoning Procedure to a spatial regression context. The proposed algorithms are applied to a series of synthetic datasets and two real-world datasets. Results indicate that all three algorithms achieve superior or comparable performance to existing approaches, while the two-stage K-Models algorithm largely outperforms existing approaches on model fitting, region reconstruction and coefficient estimation. Our work enriches the spatial analytics toolbox to explore spatial heterogeneous processes.Keywords: Regionalizationspatial heterogeneityspatial regimespatial regression NotesAcknowledgmentsThe authors thank members of Spatial Analysis Group, Spatio-temporal Social Sensing Lab for helpful discussion. The constructive comments from anonymous reviewers are gratefully acknowledged.Data and codes availability statementThe data and codes that support the findings of this study are available at https://github.com/Nithouson/regreg.Disclosure statementThe authors declare that they have no conflict of interest.Notes1 For example, the population in each region is required to be as similar as possible or above a predefined value (see Duque et al. (Citation2012), Folch and Spielman (Citation2014), Wei et al. (Citation2021)).2 Note that the optimization of spatial regimes differs from Openshaw (Citation1978), where spatial units are aggregated into areas, and each area is treated as an observation in a global regression model.3 Note that in EquationEquation 1(1) L(R)=∑j=1p∑1≤i1<i2≤nI[ui1,ui2∈Rj]||xi1−xi2||2,(1) , the number of considered unit pairs in the sum is ∑j=1M(|Rj|2), which is smaller if |Rj|(j=1,…,M) are close to each other. Hence the objective function might favor solutions whose regions have similar numbers of units.4 Throughout the paper, we describe the case of lattice data (spatial data on areal units). Our approach is also applicable to point observation data after building adjacency (with k-nearest neighbors (KNN) or Delaunay triangulation, for example).5 This usually happens when min_obs is close to n/p, where p is the number of regions. Given min_obs≪n/p, this issue does not cause problems, as observed in our experiments.6 If a region with inadequate units has two or more neighboring regions, we select the neighbor which minimizes the total SSR after the merge.7 When min_obs is too large or K is too small (close to p), exceptions may occur that the number of regions is less than p, hence the algorithm cannot produce the required number of regions by merging ‘micro-clusters’. This issue can be solved by adjusting min_obs and K.8 Let nr denote the number of units in the region. The OLS estimation of the coefficient vector is β=(XTX)−1XTy, where X is the nr×(m+1) matrix of independent variables, y is the nr-dimensional vector of dependent variable. Here the intercept is included in β by adding an independent variable with constant value 1. By applying the Sherman-Morrison formula (Bartlett Citation1951) to update the (XTX)−1 term, the time complexity can be reduced from O(m2(nr+m)) to O(m(nr+m)).9 Let βi,j denote the value of coefficient βi in region Rj. In each simulation, the list (−2,−1,0,1,2) is randomly shuffled twice and used as (β1,1,…,β1,5) and (β2,1,…,β2,5), respectively.10 Helbich et al. (Citation2013) also applied principal component analysis to the GWR coefficients. This step is skipped, as dimension reduction is not needed in our experiment.11 Different values of min_obs may be used in the two stages of K-Models algorithm. Here min_obs=10 is used for the merge stage, while min_obs in the partition stage is the number of independent variables plus 1 throughout this paper.12 Even considering the average SSR rather than the lowest, two-stage K-Models and AZP consistently outperform GWR-Skater and Skater-reg; Regional-K-Models is comparable to Skater-reg and superior to GWR-Skater.13 The GWR estimation did not complete within 30 minutes on our machine.14 Experiments on King County house price dataset is performed on a computer with an Intel Core i5-1135G7 CPU (2.40 GHz) and 16GB of memory.Additional informationFundingThis research was supported by grants from the National Natural Science Foundation of China (41830645, 42271426, 41971331, 82273731), Smart Guangzhou Spatio-temporal Information Cloud Platform Construction (GZIT2016-A5-147) and the National Key Research and Development Program of China (2021YFC2701905).Notes on contributorsHao GuoHao Guo is currently a Ph.D. candidate at the Institute of Remote Sensing and Geographic Information Systems, Peking University. He received his B.S. in Geographic Information Science and a dual B.S. in Mathematics from Peking University in 2020. His research interests include spatial analytics, geo-spatial artificial intelligence and spatial optimization.Andre PythonAndre Python is ZJU100 Young Professor in Statistics at the Center for Data Science, Zhejiang University, P.R. China. He received his B.S. and M.S. from the University of Fribourg, Switzerland, and his Ph.D. from the University of St Andrews, United Kingdom. He develops and applies spatial models and interpretable machine learning algorithms to better understand the mechanisms behind the observed patterns of spatial phenomena.Yu LiuYu Liu is currently the Boya Professor of GIScience at the Institute of Remote Sensing and Geographic Information Systems, Peking University. He received his B.S., M.S. and Ph.D. degrees from Peking University in 1994, 1997 and 2003, respectively. His research interests mainly focus on humanities and social sciences based on big geo-data.\",\"PeriodicalId\":14162,\"journal\":{\"name\":\"International Journal of Geographical Information Science\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2023-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Geographical Information Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/13658816.2023.2266493\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Geographical Information Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/13658816.2023.2266493","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
在空间回归模型中,空间异质性可以考虑连续或离散规格。后者与空间连接区域的描述有关,这些区域具有变量之间的均匀关系(空间制度)。虽然在空间分析领域已经提出和研究了各种区划算法,但优化空间制度的方法在很大程度上尚未得到探索。本文提出了两种新的空间状态描述算法:两阶段k -模型和区域k -模型。我们还将经典的自动分区程序扩展到空间回归环境。提出的算法应用于一系列合成数据集和两个真实数据集。结果表明,这三种算法的性能都优于或与现有方法相当,而两阶段K-Models算法在模型拟合、区域重建和系数估计方面大大优于现有方法。我们的工作丰富了空间分析工具箱,以探索空间异构过程。关键词:区域化;空间异质性;空间制度;感谢匿名审稿人提出的建设性意见。数据和代码可用性声明支持本研究结果的数据和代码可从https://github.com/Nithouson/regreg.Disclosure获取声明作者声明无利益冲突。注1例如,要求每个地区的人口尽可能接近或高于预定义值(参见Duque et al. (Citation2012), Folch and Spielman (Citation2014), Wei et al. (Citation2021))请注意,空间制度的优化不同于Openshaw (Citation1978),在Openshaw中,空间单元被聚合成区域,每个区域被视为全局回归模型中的一个观测值注意,在等式1(1)L(R)=∑j=1p∑1≤i1<i2≤nI[ui1,ui2∈Rj]||xi1−xi2||2,(1)中,求和中考虑的单位对个数为∑j=1M(|Rj|2),当|Rj|(j=1,…,M)彼此接近时,考虑的单位对个数越小。因此,目标函数可能倾向于具有相似单元数的区域的解在整篇论文中,我们描述了点阵数据(面单位上的空间数据)的情况。我们的方法也适用于建立邻接关系后的点观测数据(例如,与k近邻(KNN)或Delaunay三角剖分)这通常发生在min_obs接近n/p时,其中p是区域的数量。鉴于min_obs≪n/p,正如我们在实验中观察到的那样,这个问题不会造成问题如果单元不足的区域有两个或两个以上的相邻区域,我们选择合并后总SSR最小的相邻区域当min_obs太大或K太小(接近p)时,可能会出现区域数量小于p的例外情况,因此算法无法通过合并'微集群'来产生所需的区域数量。这个问题可以通过调整min_obs和K.8来解决。系数向量的OLS估计为β=(XTX) - 1XTy,其中X为自变量的nrx (m+1)矩阵,y为因变量的n维向量。在这里,通过添加一个常数为1的自变量,将截距包含在β中。通过应用Sherman-Morrison公式(Bartlett Citation1951)来更新(XTX)−1项,可以将时间复杂度从O(m2(nr+m))降低到O(m(nr+m))设βi,j表示系数βi在区域Rj中的值。在每次模拟中,列表(−2,−1,0,1,2)被随机洗牌两次,分别用作(β1,1,…,β1,5)和(β2,1,…,β2,5)Helbich等人(Citation2013)也对GWR系数进行了主成分分析。跳过这一步,因为在我们的实验中不需要降维在K-Models算法的两个阶段可以使用不同的min_obs值。这里min_obs=10用于合并阶段,而分区阶段的min_obs是本文中自变量的数量加1即使考虑平均SSR而不是最低SSR,两阶段k模型和AZP的表现也始终优于GWR-Skater和Skater-reg;Regional-K-Models与skater - regg相当,优于gwr - skater在我们的机器上,GWR估计没有在30分钟内完成在一台CPU为Intel酷睿i5-1135G7 (2.40 GHz)、内存为16GB的计算机上对King County房价数据集进行了实验。 基金资助:国家自然科学基金项目(41830645,42271426,41971331,82273731)、智慧广州时空信息云平台建设项目(GZIT2016-A5-147)和国家重点研发计划项目(2021YFC2701905)。郭浩,北京大学遥感与地理信息系统研究所博士研究生。他于2020年获得北京大学地理信息科学学士学位和数学双学士学位。主要研究方向为空间分析、地理空间人工智能和空间优化。Andre Python,浙江大学数据科学中心ZJU100青年统计学教授。他在瑞士弗里堡大学获得学士学位和硕士学位,在英国圣安德鲁斯大学获得博士学位。他开发并应用空间模型和可解释的机器学习算法,以更好地理解观察到的空间现象模式背后的机制。刘宇,现任北京大学遥感与地理信息系统研究所gisscience博雅教授。他分别于1994年、1997年和2003年获得北京大学学士学位、硕士学位和博士学位。主要研究方向为基于大地理数据的人文社会科学。
Extending regionalization algorithms to explore spatial process heterogeneity
AbstractIn spatial regression models, spatial heterogeneity may be considered with either continuous or discrete specifications. The latter is related to delineation of spatially connected regions with homogeneous relationships between variables (spatial regimes). Although various regionalization algorithms have been proposed and studied in the field of spatial analytics, methods to optimize spatial regimes have been largely unexplored. In this paper, we propose two new algorithms for spatial regime delineation, two-stage K-Models and Regional-K-Models. We also extend the classic Automatic Zoning Procedure to a spatial regression context. The proposed algorithms are applied to a series of synthetic datasets and two real-world datasets. Results indicate that all three algorithms achieve superior or comparable performance to existing approaches, while the two-stage K-Models algorithm largely outperforms existing approaches on model fitting, region reconstruction and coefficient estimation. Our work enriches the spatial analytics toolbox to explore spatial heterogeneous processes.Keywords: Regionalizationspatial heterogeneityspatial regimespatial regression NotesAcknowledgmentsThe authors thank members of Spatial Analysis Group, Spatio-temporal Social Sensing Lab for helpful discussion. The constructive comments from anonymous reviewers are gratefully acknowledged.Data and codes availability statementThe data and codes that support the findings of this study are available at https://github.com/Nithouson/regreg.Disclosure statementThe authors declare that they have no conflict of interest.Notes1 For example, the population in each region is required to be as similar as possible or above a predefined value (see Duque et al. (Citation2012), Folch and Spielman (Citation2014), Wei et al. (Citation2021)).2 Note that the optimization of spatial regimes differs from Openshaw (Citation1978), where spatial units are aggregated into areas, and each area is treated as an observation in a global regression model.3 Note that in EquationEquation 1(1) L(R)=∑j=1p∑1≤i1
期刊介绍:
International Journal of Geographical Information Science provides a forum for the exchange of original ideas, approaches, methods and experiences in the rapidly growing field of geographical information science (GIScience). It is intended to interest those who research fundamental and computational issues of geographic information, as well as issues related to the design, implementation and use of geographical information for monitoring, prediction and decision making. Published research covers innovations in GIScience and novel applications of GIScience in natural resources, social systems and the built environment, as well as relevant developments in computer science, cartography, surveying, geography and engineering in both developed and developing countries.