Several concepts borrowed from graph theory are routinely used to better understand the inner workings of the (human) brain. To this end, a connectivity network of the brain is built first, which then allows one to assess quantities such as information flow and information routing via shortest path and maximum flow computations. Since brain networks typically contain several thousand nodes and edges, computational scaling is a key research area. In this contribution, we focus on approximate maximum flow computations in large brain networks. By combining graph partitioning with maximum flow computations, we propose a new approximation algorithm for the computation of the maximum flow with runtime O(|V||E|^2/k^2) compared to the usual runtime of O(|V||E|^2) for the Edmonds-Karp algorithm, where $V$ is the set of vertices, $E$ is the set of edges, and $k$ is the number of partitions. We assess both accuracy and runtime of the proposed algorithm on simulated graphs as well as on graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com).
{"title":"An efficient heuristic for approximate maximum flow computations","authors":"Jingyun Qian, Georg Hahn","doi":"arxiv-2409.08350","DOIUrl":"https://doi.org/arxiv-2409.08350","url":null,"abstract":"Several concepts borrowed from graph theory are routinely used to better\u0000understand the inner workings of the (human) brain. To this end, a connectivity\u0000network of the brain is built first, which then allows one to assess quantities\u0000such as information flow and information routing via shortest path and maximum\u0000flow computations. Since brain networks typically contain several thousand\u0000nodes and edges, computational scaling is a key research area. In this\u0000contribution, we focus on approximate maximum flow computations in large brain\u0000networks. By combining graph partitioning with maximum flow computations, we\u0000propose a new approximation algorithm for the computation of the maximum flow\u0000with runtime O(|V||E|^2/k^2) compared to the usual runtime of O(|V||E|^2) for\u0000the Edmonds-Karp algorithm, where $V$ is the set of vertices, $E$ is the set of\u0000edges, and $k$ is the number of partitions. We assess both accuracy and runtime\u0000of the proposed algorithm on simulated graphs as well as on graphs downloaded\u0000from the Brain Networks Data Repository (https://networkrepository.com).","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts.
{"title":"Debiased high-dimensional regression calibration for errors-in-variables log-contrast models","authors":"Huali Zhao, Tianying Wang","doi":"arxiv-2409.07568","DOIUrl":"https://doi.org/arxiv-2409.07568","url":null,"abstract":"Motivated by the challenges in analyzing gut microbiome and metagenomic data,\u0000this work aims to tackle the issue of measurement errors in high-dimensional\u0000regression models that involve compositional covariates. This paper marks a\u0000pioneering effort in conducting statistical inference on high-dimensional\u0000compositional data affected by mismeasured or contaminated data. We introduce a\u0000calibration approach tailored for the linear log-contrast model. Under\u0000relatively lenient conditions regarding the sparsity level of the parameter, we\u0000have established the asymptotic normality of the estimator for inference.\u0000Numerical experiments and an application in microbiome study have demonstrated\u0000the efficacy of our high-dimensional calibration strategy in minimizing bias\u0000and achieving the expected coverage rates for confidence intervals. Moreover,\u0000the potential application of our proposed methodology extends well beyond\u0000compositional data, suggesting its adaptability for a wide range of research\u0000contexts.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Estimation in GARMA models has traditionally been carried out under the frequentist approach. To date, Bayesian approaches for such estimation have been relatively limited. In the context of GARMA models for count time series, Bayesian estimation achieves satisfactory results in terms of point estimation. Model selection in this context often relies on the use of information criteria. Despite its prominence in the literature, the use of information criteria for model selection in GARMA models for count time series have been shown to present poor performance in simulations, especially in terms of their ability to correctly identify models, even under large sample sizes. In this study, we study the problem of order selection in GARMA models for count time series, adopting a Bayesian perspective through the application of the Reversible Jump Markov Chain Monte Carlo approach. Monte Carlo simulation studies are conducted to assess the finite sample performance of the developed ideas, including point and interval inference, sensitivity analysis, effects of burn-in and thinning, as well as the choice of related priors and hyperparameters. Two real-data applications are presented, one considering automobile production in Brazil and the other considering bus exportation in Brazil before and after the COVID-19 pandemic, showcasing the method's capabilities and further exploring its flexibility.
{"title":"Order selection in GARMA models for count time series: a Bayesian perspective","authors":"Katerine Zuniga Lastra, Guilherme Pumi, Taiane Schaedler Prass","doi":"arxiv-2409.07263","DOIUrl":"https://doi.org/arxiv-2409.07263","url":null,"abstract":"Estimation in GARMA models has traditionally been carried out under the\u0000frequentist approach. To date, Bayesian approaches for such estimation have\u0000been relatively limited. In the context of GARMA models for count time series,\u0000Bayesian estimation achieves satisfactory results in terms of point estimation.\u0000Model selection in this context often relies on the use of information\u0000criteria. Despite its prominence in the literature, the use of information\u0000criteria for model selection in GARMA models for count time series have been\u0000shown to present poor performance in simulations, especially in terms of their\u0000ability to correctly identify models, even under large sample sizes. In this\u0000study, we study the problem of order selection in GARMA models for count time\u0000series, adopting a Bayesian perspective through the application of the\u0000Reversible Jump Markov Chain Monte Carlo approach. Monte Carlo simulation\u0000studies are conducted to assess the finite sample performance of the developed\u0000ideas, including point and interval inference, sensitivity analysis, effects of\u0000burn-in and thinning, as well as the choice of related priors and\u0000hyperparameters. Two real-data applications are presented, one considering\u0000automobile production in Brazil and the other considering bus exportation in\u0000Brazil before and after the COVID-19 pandemic, showcasing the method's\u0000capabilities and further exploring its flexibility.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial prediction problems often use Gaussian process models, which can be computationally burdensome in high dimensions. Specification of an appropriate covariance function for the model can be challenging when complex non-stationarities exist. Recent work has shown that pre-computed spatial basis functions and a feed-forward neural network can capture complex spatial dependence structures while remaining computationally efficient. This paper builds on this literature by tailoring spatial basis functions for use in convolutional neural networks. Through both simulated and real data, we demonstrate that this approach yields more accurate spatial predictions than existing methods. Uncertainty quantification is also considered.
{"title":"Spatial Deep Convolutional Neural Networks","authors":"Qi Wang, Paul A. Parker, Robert B. Lund","doi":"arxiv-2409.07559","DOIUrl":"https://doi.org/arxiv-2409.07559","url":null,"abstract":"Spatial prediction problems often use Gaussian process models, which can be\u0000computationally burdensome in high dimensions. Specification of an appropriate\u0000covariance function for the model can be challenging when complex\u0000non-stationarities exist. Recent work has shown that pre-computed spatial basis\u0000functions and a feed-forward neural network can capture complex spatial\u0000dependence structures while remaining computationally efficient. This paper\u0000builds on this literature by tailoring spatial basis functions for use in\u0000convolutional neural networks. Through both simulated and real data, we\u0000demonstrate that this approach yields more accurate spatial predictions than\u0000existing methods. Uncertainty quantification is also considered.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a new data assimilation (DA) scheme based on a sequential Markov Chain Monte Carlo (SMCMC) DA technique [Ruzayqat et al. 2024] which is provably convergent and has been recently used for filtering, particularly for high-dimensional non-linear, and potentially, non-Gaussian state-space models. Unlike particle filters, which can be considered exact methods and can be used for filtering non-linear, non-Gaussian models, SMCMC does not assign weights to the samples/particles, and therefore, the method does not suffer from the issue of weight-degeneracy when a relatively small number of samples is used. We design a localization approach within the SMCMC framework that focuses on regions where observations are located and restricts the transition densities included in the filtering distribution of the state to these regions. This results in immensely reducing the effective degrees of freedom and thus improving the efficiency. We test the new technique on high-dimensional ($d sim 10^4 - 10^5$) linear Gaussian model and non-linear shallow water models with Gaussian noise with real and synthetic observations. For two of the numerical examples, the observations mimic the data generated by the Surface Water and Ocean Topography (SWOT) mission led by NASA, which is a swath of ocean height observations that changes location at every assimilation time step. We also use a set of ocean drifters' real observations in which the drifters are moving according the ocean kinematics and assumed to have uncertain locations at the time of assimilation. We show that when higher accuracy is required, the proposed algorithm is superior in terms of efficiency and accuracy over competing ensemble methods and the original SMCMC filter.
{"title":"Local Sequential MCMC for Data Assimilation with Applications in Geoscience","authors":"Hamza Ruzayqat, Omar Knio","doi":"arxiv-2409.07111","DOIUrl":"https://doi.org/arxiv-2409.07111","url":null,"abstract":"This paper presents a new data assimilation (DA) scheme based on a sequential\u0000Markov Chain Monte Carlo (SMCMC) DA technique [Ruzayqat et al. 2024] which is\u0000provably convergent and has been recently used for filtering, particularly for\u0000high-dimensional non-linear, and potentially, non-Gaussian state-space models.\u0000Unlike particle filters, which can be considered exact methods and can be used\u0000for filtering non-linear, non-Gaussian models, SMCMC does not assign weights to\u0000the samples/particles, and therefore, the method does not suffer from the issue\u0000of weight-degeneracy when a relatively small number of samples is used. We\u0000design a localization approach within the SMCMC framework that focuses on\u0000regions where observations are located and restricts the transition densities\u0000included in the filtering distribution of the state to these regions. This\u0000results in immensely reducing the effective degrees of freedom and thus\u0000improving the efficiency. We test the new technique on high-dimensional ($d\u0000sim 10^4 - 10^5$) linear Gaussian model and non-linear shallow water models\u0000with Gaussian noise with real and synthetic observations. For two of the\u0000numerical examples, the observations mimic the data generated by the Surface\u0000Water and Ocean Topography (SWOT) mission led by NASA, which is a swath of\u0000ocean height observations that changes location at every assimilation time\u0000step. We also use a set of ocean drifters' real observations in which the\u0000drifters are moving according the ocean kinematics and assumed to have\u0000uncertain locations at the time of assimilation. We show that when higher\u0000accuracy is required, the proposed algorithm is superior in terms of efficiency\u0000and accuracy over competing ensemble methods and the original SMCMC filter.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Factor analysis has been extensively used to reveal the dependence structures among multivariate variables, offering valuable insight in various fields. However, it cannot incorporate the spatial heterogeneity that is typically present in spatial data. To address this issue, we introduce an effective method specifically designed to discover the potential dependence structures in multivariate spatial data. Our approach assumes that spatial locations can be approximately divided into a finite number of clusters, with locations within the same cluster sharing similar dependence structures. By leveraging an iterative algorithm that combines spatial clustering with factor analysis, we simultaneously detect spatial clusters and estimate a unique factor model for each cluster. The proposed method is evaluated through comprehensive simulation studies, demonstrating its flexibility. In addition, we apply the proposed method to a dataset of railway station attributes in the Tokyo metropolitan area, highlighting its practical applicability and effectiveness in uncovering complex spatial dependencies.
{"title":"Clustered Factor Analysis for Multivariate Spatial Data","authors":"Yanxiu Jin, Tomoya Wakayama, Renhe Jiang, Shonosuke Sugasawa","doi":"arxiv-2409.07018","DOIUrl":"https://doi.org/arxiv-2409.07018","url":null,"abstract":"Factor analysis has been extensively used to reveal the dependence structures\u0000among multivariate variables, offering valuable insight in various fields.\u0000However, it cannot incorporate the spatial heterogeneity that is typically\u0000present in spatial data. To address this issue, we introduce an effective\u0000method specifically designed to discover the potential dependence structures in\u0000multivariate spatial data. Our approach assumes that spatial locations can be\u0000approximately divided into a finite number of clusters, with locations within\u0000the same cluster sharing similar dependence structures. By leveraging an\u0000iterative algorithm that combines spatial clustering with factor analysis, we\u0000simultaneously detect spatial clusters and estimate a unique factor model for\u0000each cluster. The proposed method is evaluated through comprehensive simulation\u0000studies, demonstrating its flexibility. In addition, we apply the proposed\u0000method to a dataset of railway station attributes in the Tokyo metropolitan\u0000area, highlighting its practical applicability and effectiveness in uncovering\u0000complex spatial dependencies.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panel data arises when transitions between different states are interval-censored in multi-state data. The analysis of such data using non-parametric multi-state models was not possible until recently, but is very desirable as it allows for more flexibility than its parametric counterparts. The single available result to date has some unique drawbacks. We propose a non-parametric estimator of the transition intensities for panel data using an Expectation Maximisation algorithm. The method allows for a mix of interval-censored and right-censored (exactly observed) transitions. A condition to check for the convergence of the algorithm to the non-parametric maximum likelihood estimator is given. A simulation study comparing the proposed estimator to a consistent estimator is performed, and shown to yield near identical estimates at smaller computational cost. A data set on the emergence of teeth in children is analysed. Code to perform the analyses is publicly available.
{"title":"Non-parametric estimation of transition intensities in interval censored Markov multi-state models without loops","authors":"Daniel Gomon, Hein Putter","doi":"arxiv-2409.07176","DOIUrl":"https://doi.org/arxiv-2409.07176","url":null,"abstract":"Panel data arises when transitions between different states are\u0000interval-censored in multi-state data. The analysis of such data using\u0000non-parametric multi-state models was not possible until recently, but is very\u0000desirable as it allows for more flexibility than its parametric counterparts.\u0000The single available result to date has some unique drawbacks. We propose a\u0000non-parametric estimator of the transition intensities for panel data using an\u0000Expectation Maximisation algorithm. The method allows for a mix of\u0000interval-censored and right-censored (exactly observed) transitions. A\u0000condition to check for the convergence of the algorithm to the non-parametric\u0000maximum likelihood estimator is given. A simulation study comparing the\u0000proposed estimator to a consistent estimator is performed, and shown to yield\u0000near identical estimates at smaller computational cost. A data set on the\u0000emergence of teeth in children is analysed. Code to perform the analyses is\u0000publicly available.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Serious crime modelling typically needs to be undertaken securely behind a firewall where police knowledge and capabilities can remain undisclosed. Data informing an ongoing incident is often sparse, with a large proportion of relevant data only coming to light after the incident culminates or after police intervene - by which point it is too late to make use of the data to aid real-time decision making for the incident in question. Much of the data that is available to police to support real-time decision making is highly confidential so cannot be shared with academics, and is therefore missing to them. In this paper, we describe the development of a formal protocol where a graphical model is used as a framework for securely translating a model designed by an academic team to a model for use by a police team. We then show, for the first time, how libraries of these models can be built and used for real-time decision support to circumvent the challenges of data missingness and tardiness seen in such a secure environment. The parallel development described by this protocol ensures that any sensitive information collected by police, and missing to academics, remains secured behind a firewall. The protocol nevertheless guides police so that they are able to combine the typically incomplete data streams that are open source with their more sensitive information in a formal and justifiable way. We illustrate the application of this protocol by describing how a new entry - a suspected vehicle attack - can be embedded into such a police library of criminal plots.
{"title":"Dynamic Bayesian Networks, Elicitation and Data Embedding for Secure Environments","authors":"Kieran Drury, Jim Q. Smith","doi":"arxiv-2409.07389","DOIUrl":"https://doi.org/arxiv-2409.07389","url":null,"abstract":"Serious crime modelling typically needs to be undertaken securely behind a\u0000firewall where police knowledge and capabilities can remain undisclosed. Data\u0000informing an ongoing incident is often sparse, with a large proportion of\u0000relevant data only coming to light after the incident culminates or after\u0000police intervene - by which point it is too late to make use of the data to aid\u0000real-time decision making for the incident in question. Much of the data that\u0000is available to police to support real-time decision making is highly\u0000confidential so cannot be shared with academics, and is therefore missing to\u0000them. In this paper, we describe the development of a formal protocol where a\u0000graphical model is used as a framework for securely translating a model\u0000designed by an academic team to a model for use by a police team. We then show,\u0000for the first time, how libraries of these models can be built and used for\u0000real-time decision support to circumvent the challenges of data missingness and\u0000tardiness seen in such a secure environment. The parallel development described\u0000by this protocol ensures that any sensitive information collected by police,\u0000and missing to academics, remains secured behind a firewall. The protocol\u0000nevertheless guides police so that they are able to combine the typically\u0000incomplete data streams that are open source with their more sensitive\u0000information in a formal and justifiable way. We illustrate the application of\u0000this protocol by describing how a new entry - a suspected vehicle attack - can\u0000be embedded into such a police library of criminal plots.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Instrumental variables have become a popular study design for the estimation of treatment effects in the presence of unobserved confounders. In the canonical instrumental variables design, the instrument is a binary variable, and most extant methods are tailored to this context. In many settings, however, the instrument is a continuous measure. Standard estimation methods can be applied with continuous instruments, but they require strong assumptions regarding functional form. Moreover, while some recent work has introduced more flexible approaches for continuous instruments, these methods require an assumption known as positivity that is unlikely to hold in many applications. We derive a novel family of causal estimands using a stochastic dynamic intervention framework that considers a range of intervention distributions that are absolutely continuous with respect to the observed distribution of the instrument. These estimands focus on a specific form of local effect but do not require a positivity assumption. Next, we develop doubly robust estimators for these estimands that allow for estimation of the nuisance functions via nonparametric estimators. We use empirical process theory and sample splitting to derive asymptotic properties of the proposed estimators under weak conditions. In addition, we derive methods for profiling the principal strata as well as a method for sensitivity analysis for assessing robustness to an underlying monotonicity assumption. We evaluate our methods via simulation and demonstrate their feasibility using an application on the effectiveness of surgery for specific emergency conditions.
{"title":"Local Effects of Continuous Instruments without Positivity","authors":"Prabrisha Rakshit, Alexander Levis, Luke Keele","doi":"arxiv-2409.07350","DOIUrl":"https://doi.org/arxiv-2409.07350","url":null,"abstract":"Instrumental variables have become a popular study design for the estimation\u0000of treatment effects in the presence of unobserved confounders. In the\u0000canonical instrumental variables design, the instrument is a binary variable,\u0000and most extant methods are tailored to this context. In many settings,\u0000however, the instrument is a continuous measure. Standard estimation methods\u0000can be applied with continuous instruments, but they require strong assumptions\u0000regarding functional form. Moreover, while some recent work has introduced more\u0000flexible approaches for continuous instruments, these methods require an\u0000assumption known as positivity that is unlikely to hold in many applications.\u0000We derive a novel family of causal estimands using a stochastic dynamic\u0000intervention framework that considers a range of intervention distributions\u0000that are absolutely continuous with respect to the observed distribution of the\u0000instrument. These estimands focus on a specific form of local effect but do not\u0000require a positivity assumption. Next, we develop doubly robust estimators for\u0000these estimands that allow for estimation of the nuisance functions via\u0000nonparametric estimators. We use empirical process theory and sample splitting\u0000to derive asymptotic properties of the proposed estimators under weak\u0000conditions. In addition, we derive methods for profiling the principal strata\u0000as well as a method for sensitivity analysis for assessing robustness to an\u0000underlying monotonicity assumption. We evaluate our methods via simulation and\u0000demonstrate their feasibility using an application on the effectiveness of\u0000surgery for specific emergency conditions.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a novel method for determining the number of factors in linear factor models under stability considerations. An instability measure is proposed based on the principal angle between the estimated loading spaces obtained by data splitting. Based on this measure, criteria for determining the number of factors are proposed and shown to be consistent. This consistency is obtained using results from random matrix theory, especially the complete delocalization of non-outlier eigenvectors. The advantage of the proposed methods over the existing ones is shown via weaker asymptotic requirements for consistency, simulation studies and a real data example.
{"title":"Determining number of factors under stability considerations","authors":"Sze Ming Lee, Yunxiao Chen","doi":"arxiv-2409.07617","DOIUrl":"https://doi.org/arxiv-2409.07617","url":null,"abstract":"This paper proposes a novel method for determining the number of factors in\u0000linear factor models under stability considerations. An instability measure is\u0000proposed based on the principal angle between the estimated loading spaces\u0000obtained by data splitting. Based on this measure, criteria for determining the\u0000number of factors are proposed and shown to be consistent. This consistency is\u0000obtained using results from random matrix theory, especially the complete\u0000delocalization of non-outlier eigenvectors. The advantage of the proposed\u0000methods over the existing ones is shown via weaker asymptotic requirements for\u0000consistency, simulation studies and a real data example.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}