Pub Date : 2024-06-19DOI: 10.1007/s13253-024-00632-y
Hui-Ning Tu, Chen-Tuo Liao
Training set optimization is a crucial factor affecting the probability of success for plant breeding programs using genomic selection. Conventionally, the training set optimization is developed to maximize Pearson’s correlation between true breeding values and genomic estimated breeding values for a testing population, because it is an essential component of genetic gain in plant breeding. However, many practical breeding programs aim to identify the best genotypes for target traits in a breeding population. A modified Bayesian optimization approach is therefore developed in this study to construct training sets for tackling such an interesting problem. The proposed approach is based on Monte Carlo simulation and data cross-validation, which is shown to be competitive with the existing methods developed to achieve the maximal Pearson’s correlation. Four real genome datasets, including two rice, one wheat, and one soybean, are analyzed in this study. An R package is generated to facilitate the application of the proposed approach. Supplementary materials accompanying this paper appear online.
训练集优化是影响使用基因组选择的植物育种计划成功概率的关键因素。传统上,训练集优化的目的是使测试群体的真实育种值与基因组估计育种值之间的皮尔逊相关性最大化,因为它是植物育种遗传增益的重要组成部分。然而,许多实际的育种计划都旨在确定育种群体中目标性状的最佳基因型。因此,本研究开发了一种改进的贝叶斯优化方法,以构建训练集来解决这一有趣的问题。所提出的方法基于蒙特卡罗模拟和数据交叉验证,与为实现最大皮尔逊相关性而开发的现有方法相比,具有很强的竞争力。本研究分析了四个真实基因组数据集,包括两个水稻、一个小麦和一个大豆。为了便于应用所提出的方法,我们生成了一个 R 软件包。本文所附的补充材料可在线查阅。
{"title":"A Modified Bayesian Optimization Approach for Determining a Training Set to Identify the Best Genotypes from a Candidate Population in Genomic Selection","authors":"Hui-Ning Tu, Chen-Tuo Liao","doi":"10.1007/s13253-024-00632-y","DOIUrl":"https://doi.org/10.1007/s13253-024-00632-y","url":null,"abstract":"<p>Training set optimization is a crucial factor affecting the probability of success for plant breeding programs using genomic selection. Conventionally, the training set optimization is developed to maximize Pearson’s correlation between true breeding values and genomic estimated breeding values for a testing population, because it is an essential component of genetic gain in plant breeding. However, many practical breeding programs aim to identify the best genotypes for target traits in a breeding population. A modified Bayesian optimization approach is therefore developed in this study to construct training sets for tackling such an interesting problem. The proposed approach is based on Monte Carlo simulation and data cross-validation, which is shown to be competitive with the existing methods developed to achieve the maximal Pearson’s correlation. Four real genome datasets, including two rice, one wheat, and one soybean, are analyzed in this study. An R package is generated to facilitate the application of the proposed approach. Supplementary materials accompanying this paper appear online.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"28 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1007/s13253-024-00628-8
Juan Francisco Mandujano Reyes, Ian P. McGahan, Ting Fung Ma, Anne E. Ballmann, Daniel P. Walsh, Jun Zhu
The use of statistical methods informed by partial differential equations (PDEs) and in particular reaction–diffusion PDEs such as ecological diffusion equations (EDEs) has been studied and used to model spatiotemporal processes. In this paper, we consider a stochastic extension of the EDE (SEDE) and discuss its interpretation and main differences from the deterministic EDE. We then leverage a non-stationary extension of the diffusion-based Gaussian Matérn field and show that this extension has SEDE-like behavior. The elucidated connection enables us to find a finite element approximated solution for SEDEs by means of the stochastic partial differential equation (SPDE) Bayesian method. For illustration, we analyze the evolution of white-nose syndrome (WNS) in the continental USA, comparing two models: stationary SEDE and a non-stationary pseudo-SEDE. Our results demonstrate the importance of non-stationarity in wildlife disease modeling and identify spatial explanatory variables for the non-stationarity in the WNS process. Finally, a simulation study is conducted to assess the deviance information criterion for differentiating from the two models, as well as the identifiability of the model parameters.Supplementary materials accompanying this paper appear online.
人们研究并使用偏微分方程(PDE),特别是生态扩散方程(EDE)等反应扩散偏微分方程的统计方法来模拟时空过程。在本文中,我们考虑了 EDE 的随机扩展(SEDE),并讨论了其解释以及与确定性 EDE 的主要区别。然后,我们利用基于扩散的高斯马特恩场的非稳态扩展,证明这种扩展具有类似于 SEDE 的行为。阐明的联系使我们能够通过随机偏微分方程(SPDE)贝叶斯方法找到 SEDE 的有限元近似解。例如,我们分析了美国大陆白鼻综合征(WNS)的演变,比较了两种模型:静态 SEDE 和非静态伪 SEDE。我们的研究结果证明了非平稳性在野生动物疾病建模中的重要性,并确定了 WNS 过程中非平稳性的空间解释变量。最后,我们进行了一项模拟研究,以评估区分两种模型的偏差信息标准,以及模型参数的可识别性。
{"title":"Non-stationary Extensions of the Diffusion-Based Gaussian Matérn Field for Ecological Applications","authors":"Juan Francisco Mandujano Reyes, Ian P. McGahan, Ting Fung Ma, Anne E. Ballmann, Daniel P. Walsh, Jun Zhu","doi":"10.1007/s13253-024-00628-8","DOIUrl":"https://doi.org/10.1007/s13253-024-00628-8","url":null,"abstract":"<p>The use of statistical methods informed by partial differential equations (PDEs) and in particular reaction–diffusion PDEs such as ecological diffusion equations (EDEs) has been studied and used to model spatiotemporal processes. In this paper, we consider a stochastic extension of the EDE (SEDE) and discuss its interpretation and main differences from the deterministic EDE. We then leverage a non-stationary extension of the diffusion-based Gaussian Matérn field and show that this extension has SEDE-like behavior. The elucidated connection enables us to find a finite element approximated solution for SEDEs by means of the stochastic partial differential equation (SPDE) Bayesian method. For illustration, we analyze the evolution of white-nose syndrome (WNS) in the continental USA, comparing two models: stationary SEDE and a non-stationary pseudo-SEDE. Our results demonstrate the importance of non-stationarity in wildlife disease modeling and identify spatial explanatory variables for the non-stationarity in the WNS process. Finally, a simulation study is conducted to assess the deviance information criterion for differentiating from the two models, as well as the identifiability of the model parameters.Supplementary materials accompanying this paper appear online.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"5 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141195839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-11DOI: 10.1007/s13253-024-00626-w
Alec B. M. Van Helsdingen, Tiago A. Marques, Charlotte M. Jones-Todd
A Hawkes point process describes self-exciting behaviour where event arrivals are triggered by historic events. These models are increasingly becoming a popular choice in analysing event-type data. Like all other inhomogeneous Poisson point processes, the waiting time between events in a Hawkes process is derived from an exponential distribution with mean one. However, as with many ecological and environmental data, this is an unrealistic assumption. We, therefore, extend and generalise the Hawkes process to account for potential under- or overdispersion in the waiting times between events by assuming the Weibull distribution as the foundation of the waiting times. We apply this model to the acoustic cue production times of sperm whales and show that our Weibull–Hawkes model better captures the inherent underdispersion in the interarrival times of echolocation clicks emitted by these whales.
{"title":"An Inhomogeneous Weibull–Hawkes Process to Model Underdispersed Acoustic Cues","authors":"Alec B. M. Van Helsdingen, Tiago A. Marques, Charlotte M. Jones-Todd","doi":"10.1007/s13253-024-00626-w","DOIUrl":"https://doi.org/10.1007/s13253-024-00626-w","url":null,"abstract":"<p>A Hawkes point process describes self-exciting behaviour where event arrivals are triggered by historic events. These models are increasingly becoming a popular choice in analysing event-type data. Like all other inhomogeneous Poisson point processes, the waiting time between events in a Hawkes process is derived from an exponential distribution with mean one. However, as with many ecological and environmental data, this is an unrealistic assumption. We, therefore, extend and generalise the Hawkes process to account for potential under- or overdispersion in the waiting times between events by assuming the Weibull distribution as the foundation of the waiting times. We apply this model to the acoustic cue production times of sperm whales and show that our Weibull–Hawkes model better captures the inherent underdispersion in the interarrival times of echolocation clicks emitted by these whales.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"40 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-29DOI: 10.1007/s13253-024-00619-9
Ben Seiyon Lee, Murali Haran
Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be computationally expensive for large data due to high-dimensional dependent latent variables, costly matrix operations, and slow mixing Markov chains. We describe a flexible, computationally efficient approach for modeling large zero-inflated spatial data using the projection-based intrinsic conditional autoregression (PICAR) framework. We study our approach, which we call PICAR-Z, through extensive simulation studies and two environmental data sets. Our results suggest that PICAR-Z provides accurate predictions while remaining computationally efficient. An important goal of our work is to allow researchers who are not experts in computation to easily build computationally efficient extensions to zero-inflated spatial models; this also allows for a more thorough exploration of modeling choices in two-part models than was previously possible. We show that PICAR-Z is easy to implement and extend in popular probabilistic programming languages such as nimble and stan.
{"title":"A class of models for large zero-inflated spatial data","authors":"Ben Seiyon Lee, Murali Haran","doi":"10.1007/s13253-024-00619-9","DOIUrl":"https://doi.org/10.1007/s13253-024-00619-9","url":null,"abstract":"<p>Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be computationally expensive for large data due to high-dimensional dependent latent variables, costly matrix operations, and slow mixing Markov chains. We describe a flexible, computationally efficient approach for modeling large zero-inflated spatial data using the projection-based intrinsic conditional autoregression (PICAR) framework. We study our approach, which we call PICAR-Z, through extensive simulation studies and two environmental data sets. Our results suggest that PICAR-Z provides accurate predictions while remaining computationally efficient. An important goal of our work is to allow researchers who are not experts in computation to easily build computationally efficient extensions to zero-inflated spatial models; this also allows for a more thorough exploration of modeling choices in two-part models than was previously possible. We show that PICAR-Z is easy to implement and extend in popular probabilistic programming languages such as <span>nimble</span> and <span>stan</span>.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140808823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-28DOI: 10.1007/s13253-024-00624-y
Dhanushi A. Wijeyakulasuriya, Ephraim M. Hanks, Benjamin A. Shaby
Humans have recorded the arrival dates of migratory birds for millennia, searching for trends and patterns. As the first arrival among individuals in a species is the realized tail of the probability distribution of arrivals, the appropriate statistical framework with which to analyze such events is extreme value theory. Here, for the first time, we apply formal extreme value techniques to the dynamics of bird migrations. We study the annual first arrivals of Magnolia Warblers using modern tools from the statistical field of extreme value analysis. Using observations from the eBird database, we model the spatial distribution of observed Magnolia Warbler arrivals as a max-infinitely divisible process, which allows us to spatially interpolate observed annual arrivals in a probabilistically coherent way and to project arrival dynamics into the future by conditioning on climatic variables. Supplementary materials accompanying this paper appear online.
{"title":"Modeling First Arrival of Migratory Birds Using a Hierarchical Max-Infinitely Divisible Process","authors":"Dhanushi A. Wijeyakulasuriya, Ephraim M. Hanks, Benjamin A. Shaby","doi":"10.1007/s13253-024-00624-y","DOIUrl":"https://doi.org/10.1007/s13253-024-00624-y","url":null,"abstract":"<p>Humans have recorded the arrival dates of migratory birds for millennia, searching for trends and patterns. As the first arrival among individuals in a species is the realized tail of the probability distribution of arrivals, the appropriate statistical framework with which to analyze such events is extreme value theory. Here, for the first time, we apply formal extreme value techniques to the dynamics of bird migrations. We study the annual first arrivals of Magnolia Warblers using modern tools from the statistical field of extreme value analysis. Using observations from the eBird database, we model the spatial distribution of observed Magnolia Warbler arrivals as a max-infinitely divisible process, which allows us to spatially interpolate observed annual arrivals in a probabilistically coherent way and to project arrival dynamics into the future by conditioning on climatic variables. Supplementary materials accompanying this paper appear online.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"52 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140808825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1007/s13253-024-00616-y
Xinyi Lu, Yoichiro Kanno, George P. Valentine, Matt A. Kulp, Mevin B. Hooten
Climate change impacts ecosystems variably in space and time. Landscape features may confer resistance against environmental stressors, whose intensity and frequency also depend on local weather patterns. Characterizing spatio-temporal variation in population responses to these stressors improves our understanding of what constitutes climate change refugia. We developed a Bayesian hierarchical framework that allowed us to differentiate population responses to seasonal weather patterns depending on their “sensitive” or “resilient” states. The framework inferred these sensitivity states based on latent trajectories delineating dynamic state probabilities. The latent trajectories are composed of linear initial conditions, functional regression models, and additive random effects representing ecological mechanisms such as topological buffering and effects of legacy weather conditions. Further, we developed a Bayesian regularization strategy that promoted temporal coherence in the inferred states. We demonstrated our hierarchical framework and regularization strategy using simulated examples and a case study of native brook trout (Salvelinus fontinalis) count data from the Great Smoky Mountains National Park, southeastern USA. Our study provided insights into ecological processes influencing brook trout sensitivity. Our framework can also be applied to other species and ecosystems to facilitate management and conservation.
{"title":"Regularized Latent Trajectory Models for Spatio-temporal Population Dynamics","authors":"Xinyi Lu, Yoichiro Kanno, George P. Valentine, Matt A. Kulp, Mevin B. Hooten","doi":"10.1007/s13253-024-00616-y","DOIUrl":"https://doi.org/10.1007/s13253-024-00616-y","url":null,"abstract":"<p>Climate change impacts ecosystems variably in space and time. Landscape features may confer resistance against environmental stressors, whose intensity and frequency also depend on local weather patterns. Characterizing spatio-temporal variation in population responses to these stressors improves our understanding of what constitutes climate change refugia. We developed a Bayesian hierarchical framework that allowed us to differentiate population responses to seasonal weather patterns depending on their “sensitive” or “resilient” states. The framework inferred these sensitivity states based on latent trajectories delineating dynamic state probabilities. The latent trajectories are composed of linear initial conditions, functional regression models, and additive random effects representing ecological mechanisms such as topological buffering and effects of legacy weather conditions. Further, we developed a Bayesian regularization strategy that promoted temporal coherence in the inferred states. We demonstrated our hierarchical framework and regularization strategy using simulated examples and a case study of native brook trout (<i>Salvelinus fontinalis</i>) count data from the Great Smoky Mountains National Park, southeastern USA. Our study provided insights into ecological processes influencing brook trout sensitivity. Our framework can also be applied to other species and ecosystems to facilitate management and conservation.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"12 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-13DOI: 10.1007/s13253-024-00611-3
Andrew O. Finley, Hans-Erik Andersen, Chad Babcock, Bruce D. Cook, Douglas C. Morton, Sudipto Banerjee
A two-stage hierarchical Bayesian model is developed and implemented to estimate forest biomass density and total given sparsely sampled LiDAR and georeferenced forest inventory plot measurements. The model is motivated by the United States Department of Agriculture (USDA) Forest Service Forest Inventory and Analysis (FIA) objective to provide biomass estimates for the remote Tanana Inventory Unit (TIU) in interior Alaska. The proposed model yields stratum-level biomass estimates for arbitrarily sized areas. Model-based estimates are compared with the TIU FIA design-based post-stratified estimates. Model-based small area estimates (SAEs) for two experimental forests within the TIU are compared with each forest’s design-based estimates generated using a dense network of independent inventory plots. Model parameter estimates and biomass predictions are informed using FIA plot measurements, LiDAR data that are spatially aligned with a subset of the FIA plots, and complete coverage remotely detected data used to define landuse/landcover stratum and percent forest canopy cover. Results support a model-based approach to estimating forest parameters when inventory data are sparse or resources limit collection of enough data to achieve desired accuracy and precision using design-based methods. Supplementary materials accompanying this paper appear on-line
本研究开发并实施了一个两阶段分层贝叶斯模型,用于估算稀疏采样的激光雷达和地理参照森林资源调查小区的森林生物量密度和总量。美国农业部 (USDA) 林业局森林资源调查与分析 (FIA) 的目标是为阿拉斯加内陆偏远的塔纳纳调查单元 (TIU) 提供生物量估算,而该模型正是基于此目标而开发的。建议的模型可对任意大小的区域进行分层生物量估算。基于模型的估算值与 TIU FIA 设计的分层后估算值进行了比较。对 TIU 内的两片实验林进行了基于模型的小面积估算(SAE),并将其与利用密集的独立清查地块网络生成的每片林的基于设计的估算进行了比较。模型参数估计和生物量预测使用了森林资源评估地块测量数据、与森林资源评估地块子集在空间上一致的激光雷达数据,以及用于定义土地利用/土地覆盖层和森林冠层覆盖率的完整覆盖遥感数据。研究结果支持采用基于模型的方法估算森林参数,当清查数据稀少或资源限制无法收集足够的数据时,采用基于设计的方法可达到理想的准确度和精确度。本文附带的补充材料可在线查阅
{"title":"Models to Support Forest Inventory and Small Area Estimation Using Sparsely Sampled LiDAR: A Case Study Involving G-LiHT LiDAR in Tanana, Alaska","authors":"Andrew O. Finley, Hans-Erik Andersen, Chad Babcock, Bruce D. Cook, Douglas C. Morton, Sudipto Banerjee","doi":"10.1007/s13253-024-00611-3","DOIUrl":"https://doi.org/10.1007/s13253-024-00611-3","url":null,"abstract":"<p>A two-stage hierarchical Bayesian model is developed and implemented to estimate forest biomass density and total given sparsely sampled LiDAR and georeferenced forest inventory plot measurements. The model is motivated by the United States Department of Agriculture (USDA) Forest Service Forest Inventory and Analysis (FIA) objective to provide biomass estimates for the remote Tanana Inventory Unit (TIU) in interior Alaska. The proposed model yields stratum-level biomass estimates for arbitrarily sized areas. Model-based estimates are compared with the TIU FIA design-based post-stratified estimates. Model-based small area estimates (SAEs) for two experimental forests within the TIU are compared with each forest’s design-based estimates generated using a dense network of independent inventory plots. Model parameter estimates and biomass predictions are informed using FIA plot measurements, LiDAR data that are spatially aligned with a subset of the FIA plots, and complete coverage remotely detected data used to define landuse/landcover stratum and percent forest canopy cover. Results support a model-based approach to estimating forest parameters when inventory data are sparse or resources limit collection of enough data to achieve desired accuracy and precision using design-based methods. Supplementary materials accompanying this paper appear on-line</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"145 17 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140148621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-08DOI: 10.1007/s13253-024-00602-4
Arnab Hazra, Pratik Nag, Rishikesh Yadav, Ying Sun
Increasingly large and complex spatial datasets pose massive inferential challenges due to high computational and storage costs. Our study is motivated by the KAUST Competition on Large Spatial Datasets 2023, which tasked participants with estimating spatial covariance-related parameters and predicting values at testing sites, along with uncertainty estimates. We compared various statistical and deep learning approaches through cross-validation and ultimately selected the Vecchia approximation technique for model fitting. To overcome the constraints in the R package GpGp, which lacked support for fitting zero-mean Gaussian processes and direct uncertainty estimation—two things that are necessary for the competition, we developed additional R functions. Besides, we implemented certain subsampling-based approximations and parametric smoothing for skewed sampling distributions of the estimators. Our team DesiBoys secured the first position in two out of four sub-competitions and the second position in the other two, validating the effectiveness of our proposed strategies. Moreover, we extended our evaluation to a large real spatial satellite-derived dataset on total precipitable water, where we compared the predictive performances of different models using multiple diagnostics.
{"title":"Exploring the Efficacy of Statistical and Deep Learning Methods for Large Spatial Datasets: A Case Study","authors":"Arnab Hazra, Pratik Nag, Rishikesh Yadav, Ying Sun","doi":"10.1007/s13253-024-00602-4","DOIUrl":"https://doi.org/10.1007/s13253-024-00602-4","url":null,"abstract":"<p>Increasingly large and complex spatial datasets pose massive inferential challenges due to high computational and storage costs. Our study is motivated by the KAUST Competition on Large Spatial Datasets 2023, which tasked participants with estimating spatial covariance-related parameters and predicting values at testing sites, along with uncertainty estimates. We compared various statistical and deep learning approaches through cross-validation and ultimately selected the Vecchia approximation technique for model fitting. To overcome the constraints in the <span>R</span> package <span>GpGp</span>, which lacked support for fitting zero-mean Gaussian processes and direct uncertainty estimation—two things that are necessary for the competition, we developed additional <span>R</span> functions. Besides, we implemented certain subsampling-based approximations and parametric smoothing for skewed sampling distributions of the estimators. Our team DesiBoys secured the first position in two out of four sub-competitions and the second position in the other two, validating the effectiveness of our proposed strategies. Moreover, we extended our evaluation to a large real spatial satellite-derived dataset on total precipitable water, where we compared the predictive performances of different models using multiple diagnostics.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"527 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139759689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1007/s13253-023-00599-2
Pierre Dutilleul, Tomoaki Imoto, Kunio Shimizu
To analyze tree growth statistically through annual ring widths measured in 2-D horizontal trunk sections, we propose two tests of significance defined under a linear-circular regression model with fixed trigonometric effects and normal random errors with a variance-covariance structure from the symmetric circulant family. The associated von Mises distribution has a preferred direction parameter. Accordingly, the first test aims to assess the presence of a preferred direction in the radial growth of a tree from the center of its trunk in a given year. Assuming there is a preferred direction of radial growth for the tree in two years, the second test extends the first one by assessing the equality of tree radial growth in the two preferred directions. Both tests of significance are modified F-tests with the denominator df adjusted for the presence of autocorrelation. Their validity is analyzed for two autoregressive symmetric circulant correlation structures, as a function of the number (n) of angular data and the autocorrelation parameter value. Effects of the inter-year correlation coefficient value are also studied in the two-year case. The performance of REstricted Maximum Likelihood as estimation method is scrutinized in an extensive Monte Carlo study, and the power of the tests is analyzed when valid. The new testing procedures are applied with (n = 32, 64) ring widths per year for a white spruce tree during 18 years of growth until its harvest. R codes are available. Conclusions and perspectives for future research are given. Supplementary materials accompanying this paper appear on-line.
为了通过二维水平树干截面测量的年轮宽度对树木生长进行统计分析,我们提出了两种显著性检验方法,其定义条件是线性圆回归模型具有固定的三角效应和正态随机误差,其方差-协方差结构属于对称环状族。相关的 von Mises 分布有一个优先方向参数。因此,第一个测试的目的是评估树木在某一年从树干中心开始的径向生长是否存在首选方向。假定树木在两年中的径向生长有一个首选方向,第二个检验扩展了第一个检验,评估树木在两个首选方向上的径向生长是否相等。这两个显著性检验都是修正的 F 检验,分母 df 根据自相关的存在进行了调整。针对两种自回归对称环状相关结构,分析了它们的有效性,作为角度数据数量(n)和自相关参数值的函数。在两年的情况下,还研究了年际相关系数值的影响。在广泛的蒙特卡罗研究中,对作为估计方法的限制最大似然法的性能进行了仔细检查,并分析了有效时的检验功率。新的测试程序在一棵白云杉 18 年的生长直至采伐期间,每年的环宽为(n = 32,64)。提供了 R 代码。文中给出了结论和对未来研究的展望。本文所附的补充材料可在线查阅。
{"title":"Two Tests of Significance for Preferred Direction in Tree Radial Growth Under a Linear-Circular Regression Model with Correlated Random Errors","authors":"Pierre Dutilleul, Tomoaki Imoto, Kunio Shimizu","doi":"10.1007/s13253-023-00599-2","DOIUrl":"https://doi.org/10.1007/s13253-023-00599-2","url":null,"abstract":"<p>To analyze tree growth statistically through annual ring widths measured in 2-D horizontal trunk sections, we propose two tests of significance defined under a linear-circular regression model with fixed trigonometric effects and normal random errors with a variance-covariance structure from the symmetric circulant family. The associated von Mises distribution has a preferred direction parameter. Accordingly, the first test aims to assess the presence of a preferred direction in the radial growth of a tree from the center of its trunk in a given year. Assuming there is a preferred direction of radial growth for the tree in two years, the second test extends the first one by assessing the equality of tree radial growth in the two preferred directions. Both tests of significance are modified <i>F</i>-tests with the denominator <i>df</i> adjusted for the presence of autocorrelation. Their validity is analyzed for two autoregressive symmetric circulant correlation structures, as a function of the number (<i>n</i>) of angular data and the autocorrelation parameter value. Effects of the inter-year correlation coefficient value are also studied in the two-year case. The performance of REstricted Maximum Likelihood as estimation method is scrutinized in an extensive Monte Carlo study, and the power of the tests is analyzed when valid. The new testing procedures are applied with <span>(n = 32, 64)</span> ring widths per year for a white spruce tree during 18 years of growth until its harvest. R codes are available. Conclusions and perspectives for future research are given. Supplementary materials accompanying this paper appear on-line.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"231 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139656375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-30DOI: 10.1007/s13253-024-00600-6
Paul B. May, Andrew O. Finley, Ralph O. Dubayah
The Global Ecosystem Dynamics Investigation (GEDI) is a spaceborne lidar instrument that collects near-global measurements of forest structure. While expansive in scope, GEDI samples are spatially sparse and cover a small fraction of the land surface. Converting the sparse samples into spatially complete predictive maps is of practical importance for a number of ecological studies. A complicating factor is that GEDI collects measurements over forested and non-forested land alike, with no automatic labeling of the land type. Such classification is important, as it categorically influences the probability distribution of the spatial process and the ecological interpretation of the observations/predictions. We propose and implement a spatial mixture model, separating the observations and the greater spatial domain into two latent classes. The latent classes are governed by a Bernoulli spatial process, with spatial effects driven by a Gaussian process. Within each class, the process is governed by a separate spatial model, describing the unique probabilistic attributes. Model predictions take the form of scalar predictions of the GEDI observables as well as discrete labeling of the class membership. Inference is conducted through a Bayesian paradigm, yielding rich quantification of prediction and uncertainty through posterior predictive distributions. We demonstrate the method using GEDI data over Wollemi National Park, Australia, using optical data from Landsat 8 as model covariates. When compared to a single spatial model, the mixture model achieves much higher posterior predictive densities on the true value. When compared to a random forest model, a common algorithmic approach in the remote sensing community, the random forest achieves better absolute prediction accuracy for prediction locations far from observed training data locations, but at the expense of location-specific assessments of uncertainty. The unsupervised binary classifications of the mixture model appear broadly ecologically interpretable as forest and non-forest when compared to optical imagery, but further comparison to ground-truth data is required.
{"title":"A Spatial Mixture Model for Spaceborne Lidar Observations Over Mixed Forest and Non-forest Land Types","authors":"Paul B. May, Andrew O. Finley, Ralph O. Dubayah","doi":"10.1007/s13253-024-00600-6","DOIUrl":"https://doi.org/10.1007/s13253-024-00600-6","url":null,"abstract":"<p>The Global Ecosystem Dynamics Investigation (GEDI) is a spaceborne lidar instrument that collects near-global measurements of forest structure. While expansive in scope, GEDI samples are spatially sparse and cover a small fraction of the land surface. Converting the sparse samples into spatially complete predictive maps is of practical importance for a number of ecological studies. A complicating factor is that GEDI collects measurements over forested and non-forested land alike, with no automatic labeling of the land type. Such classification is important, as it categorically influences the probability distribution of the spatial process and the ecological interpretation of the observations/predictions. We propose and implement a spatial mixture model, separating the observations and the greater spatial domain into two latent classes. The latent classes are governed by a Bernoulli spatial process, with spatial effects driven by a Gaussian process. Within each class, the process is governed by a separate spatial model, describing the unique probabilistic attributes. Model predictions take the form of scalar predictions of the GEDI observables as well as discrete labeling of the class membership. Inference is conducted through a Bayesian paradigm, yielding rich quantification of prediction and uncertainty through posterior predictive distributions. We demonstrate the method using GEDI data over Wollemi National Park, Australia, using optical data from Landsat 8 as model covariates. When compared to a single spatial model, the mixture model achieves much higher posterior predictive densities on the true value. When compared to a random forest model, a common algorithmic approach in the remote sensing community, the random forest achieves better absolute prediction accuracy for prediction locations far from observed training data locations, but at the expense of location-specific assessments of uncertainty. The unsupervised binary classifications of the mixture model appear broadly ecologically interpretable as forest and non-forest when compared to optical imagery, but further comparison to ground-truth data is required.</p>","PeriodicalId":56336,"journal":{"name":"Journal of Agricultural Biological and Environmental Statistics","volume":"35 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139649379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}