Pub Date : 2025-12-03DOI: 10.1016/j.jmva.2025.105565
Yu Han , Peng Luo , Wei Zhang , Xiang Gu
The direct effect of a treatment variable and the indirect effect through a mediator variable on an endpoint variable are important for understanding a causal mechanism. The Controlled direct effect has a prescriptive interpretation, while the natural direct and indirect effects have a descriptive interpretation. In practice, these three effects are usually very difficult to identify. To tackle this problem, some researchers investigated the upper and lower bounds of these three effects when some reasonable identification conditions hold. For example, Luo and Geng (2016) gave the upper and lower bounds of these direct and indirect effects when there is an unobserved mediator-endpoint confounder vector and the endpoint variable is continuous. In this paper, we tighten the bounds on controlled direct effect in Luo and Geng (2016) when part of the confounders can be observed. Additionally, we give a sufficient condition to identify the direct and indirect effects when the variables satisfy one linear relationship.
{"title":"Bounds and identification on direct and indirect effects under partially observed mediator-endpoint confounders","authors":"Yu Han , Peng Luo , Wei Zhang , Xiang Gu","doi":"10.1016/j.jmva.2025.105565","DOIUrl":"10.1016/j.jmva.2025.105565","url":null,"abstract":"<div><div>The direct effect of a treatment variable and the indirect effect through a mediator variable on an endpoint variable are important for understanding a causal mechanism. The Controlled direct effect has a prescriptive interpretation, while the natural direct and indirect effects have a descriptive interpretation. In practice, these three effects are usually very difficult to identify. To tackle this problem, some researchers investigated the upper and lower bounds of these three effects when some reasonable identification conditions hold. For example, Luo and Geng (2016) gave the upper and lower bounds of these direct and indirect effects when there is an unobserved mediator-endpoint confounder vector and the endpoint variable is continuous. In this paper, we tighten the bounds on controlled direct effect in Luo and Geng (2016) when part of the confounders can be observed. Additionally, we give a sufficient condition to identify the direct and indirect effects when the variables satisfy one linear relationship.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"213 ","pages":"Article 105565"},"PeriodicalIF":1.4,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145683761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1016/j.jmva.2025.105559
Lei Ge , Rong Liu , Tao Hu , Jianguo Sun
Panel count data are a general type of data arising from the studies on recurrent events and occur when the observed information on each study subject consists of only the numbers of the occurrences of the recurrent events between successive examinations. It is easy to see that such data can occur in many fields, including economic studies, medical studies and social sciences. This paper considers regression analysis of multivariate panel count data with the focus on variable selection and estimation of significant covariate effects. For the problem, a minimum information criterion-based method is proposed and an expectation–maximization algorithm is developed for the determination of the proposed estimator. Furthermore, the resulting estimator is shown to have the desirable oracle property and a simulation study is performed and confirms the good finite-sample properties of the proposed method. Finally the method is applied to a set of real data arising from a skin cancer study.
{"title":"Simultaneous variable selection and estimation of multivariate panel count data","authors":"Lei Ge , Rong Liu , Tao Hu , Jianguo Sun","doi":"10.1016/j.jmva.2025.105559","DOIUrl":"10.1016/j.jmva.2025.105559","url":null,"abstract":"<div><div>Panel count data are a general type of data arising from the studies on recurrent events and occur when the observed information on each study subject consists of only the numbers of the occurrences of the recurrent events between successive examinations. It is easy to see that such data can occur in many fields, including economic studies, medical studies and social sciences. This paper considers regression analysis of multivariate panel count data with the focus on variable selection and estimation of significant covariate effects. For the problem, a minimum information criterion-based method is proposed and an expectation–maximization algorithm is developed for the determination of the proposed estimator. Furthermore, the resulting estimator is shown to have the desirable oracle property and a simulation study is performed and confirms the good finite-sample properties of the proposed method. Finally the method is applied to a set of real data arising from a skin cancer study.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"213 ","pages":"Article 105559"},"PeriodicalIF":1.4,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145683760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.jmva.2025.105567
Jiaqi Hu, Tingyin Wang, Xueqin Wang
The large dimensional factor model, aimed at reducing dimensionality and extracting features through a few latent common factors, has sparked significant interest due to its broad applications. Despite the popularity of traditional methods for factor models, they may yield incorrect estimators for heavy-tailed data. To address this issue, we introduce the exponential squared loss to the factor model in this study, denoted as the Robust Exponential Factor Analysis (REFA). We propose a modified rank minimization technique to enhance the estimation accuracy of factor numbers in finite-sample cases. Consistency properties for factors and loadings are established under mild conditions, without any moment assumptions on the errors. The performance of REFA with finite samples under both light and heavy-tailed cases has been demonstrated through simulation studies. Furthermore, an analysis employing a financial dataset of asset returns underscores the superiority of REFA. To facilitate the implementation of our proposed methodology by researchers, we have developed an R package named REFA, which is available on CRAN.
{"title":"Robust factor analysis with exponential squared loss","authors":"Jiaqi Hu, Tingyin Wang, Xueqin Wang","doi":"10.1016/j.jmva.2025.105567","DOIUrl":"10.1016/j.jmva.2025.105567","url":null,"abstract":"<div><div>The large dimensional factor model, aimed at reducing dimensionality and extracting features through a few latent common factors, has sparked significant interest due to its broad applications. Despite the popularity of traditional methods for factor models, they may yield incorrect estimators for heavy-tailed data. To address this issue, we introduce the exponential squared loss to the factor model in this study, denoted as the Robust Exponential Factor Analysis (REFA). We propose a modified rank minimization technique to enhance the estimation accuracy of factor numbers in finite-sample cases. Consistency properties for factors and loadings are established under mild conditions, without any moment assumptions on the errors. The performance of REFA with finite samples under both light and heavy-tailed cases has been demonstrated through simulation studies. Furthermore, an analysis employing a financial dataset of asset returns underscores the superiority of REFA. To facilitate the implementation of our proposed methodology by researchers, we have developed an <span>R</span> package named <span>REFA</span>, which is available on <span>CRAN</span>.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"213 ","pages":"Article 105567"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145683763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.jmva.2025.105576
Feng Yang , Zheng Zhou , Yongdao Zhou
The factors that exist only at certain levels of other factors are called the nested factors. The factors that lead to such nested factors are called the branching factors. Experiments with branching and nested factors occur frequently in practical applications. Designing such experiments is challenging due to the special relationship between the branching and nested factors. In this paper, we propose uniform designs for experiments involving branching and nested factors. A novel criterion is introduced to measure the uniformity of such designs, and the corresponding lower bound is also given. The construction methods of uniform designs for experiments with branching and nested factors are provided, and their effectiveness is verified by simulation comparisons and a practical manufacturing experiment. The proposed method allows each of branching, nested and shared factors to be either qualitative or quantitative. Moreover, the run size and the levels of quantitative factors are very flexible, such that our method works well for both physical and computer experiments.
{"title":"Uniform designs for experiments with branching and nested factors","authors":"Feng Yang , Zheng Zhou , Yongdao Zhou","doi":"10.1016/j.jmva.2025.105576","DOIUrl":"10.1016/j.jmva.2025.105576","url":null,"abstract":"<div><div>The factors that exist only at certain levels of other factors are called the nested factors. The factors that lead to such nested factors are called the branching factors. Experiments with branching and nested factors occur frequently in practical applications. Designing such experiments is challenging due to the special relationship between the branching and nested factors. In this paper, we propose uniform designs for experiments involving branching and nested factors. A novel criterion is introduced to measure the uniformity of such designs, and the corresponding lower bound is also given. The construction methods of uniform designs for experiments with branching and nested factors are provided, and their effectiveness is verified by simulation comparisons and a practical manufacturing experiment. The proposed method allows each of branching, nested and shared factors to be either qualitative or quantitative. Moreover, the run size and the levels of quantitative factors are very flexible, such that our method works well for both physical and computer experiments.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105576"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145681836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1016/j.jmva.2025.105570
Weixiong Liang , Yuehan Yang
A prominent problem in multi-response models is the presence of complex group structures of the high-dimensional data such as the overlapping group structures. In such models, both responses and predictors are grouped, and each response group is allowed to relate to multiple predictor groups. Ignoring such structures often yields insufficient statistical inference and misleading statistical conclusions. Motivated by practical needs and the sequential canonical correlation search (SCCS) algorithm proposed by Luo and Chen (2020), this paper proposes two computationally attractive feature selection algorithms, reallocating-SCCS (RSCCS) and prescreening-SCCS (PSCCS), for the high-dimensional multi-response models with complex group structures.
The proposed methods, RSCCS and PSCCS, consist of three steps. In the first step, to fully incorporate the information of group structures in the feature selection algorithm, both RSCCS and PSCCS select a non-zero coefficient block according to the canonical correlation between the residual response groups and feature groups. In the second step, RSCCS selects the non-zero coefficient row, while PSCCS conducts screening within the non-zero coefficient block using penalized regularizations. In the third step, RSCCS and PSCCS select features by EBIC based on different situations and different iterations. We demonstrate the advantages of these two methods compared with several existing approaches. The statistical guarantees of RSCCS and PSCCS are established. We provide numerical simulation results and analyze a real data example to compare their performance with other methods.
{"title":"Iterative sequential screening strategies for sparse recovery with computational advantages","authors":"Weixiong Liang , Yuehan Yang","doi":"10.1016/j.jmva.2025.105570","DOIUrl":"10.1016/j.jmva.2025.105570","url":null,"abstract":"<div><div>A prominent problem in multi-response models is the presence of complex group structures of the high-dimensional data such as the overlapping group structures. In such models, both responses and predictors are grouped, and each response group is allowed to relate to multiple predictor groups. Ignoring such structures often yields insufficient statistical inference and misleading statistical conclusions. Motivated by practical needs and the sequential canonical correlation search (SCCS) algorithm proposed by Luo and Chen (2020), this paper proposes two computationally attractive feature selection algorithms, reallocating-SCCS (RSCCS) and prescreening-SCCS (PSCCS), for the high-dimensional multi-response models with complex group structures.</div><div>The proposed methods, RSCCS and PSCCS, consist of three steps. In the first step, to fully incorporate the information of group structures in the feature selection algorithm, both RSCCS and PSCCS select a non-zero coefficient block according to the canonical correlation between the residual response groups and feature groups. In the second step, RSCCS selects the non-zero coefficient row, while PSCCS conducts screening within the non-zero coefficient block using penalized regularizations. In the third step, RSCCS and PSCCS select features by EBIC based on different situations and different iterations. We demonstrate the advantages of these two methods compared with several existing approaches. The statistical guarantees of RSCCS and PSCCS are established. We provide numerical simulation results and analyze a real data example to compare their performance with other methods.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105570"},"PeriodicalIF":1.4,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145616474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1016/j.jmva.2025.105558
Lu Yan , Jiang Hu
The proliferation of science and technology has led to the prevalence of voluminous data sets distributed across multiple machines. Conventional statistical methodologies may be infeasible in analyzing such massive data sets due to prohibitively long computing durations, memory constraints, communication overheads, and confidentiality considerations. In this paper, we propose distributed estimators of the spiked eigenvalues in spiked population models. The consistency and asymptotic normality of the distributed estimators are derived, and the statistical error analysis of the distributed estimators is also provided. Compared to the estimation from the full sample, the proposed distributed estimation shares the same order of convergence. Simulation study and real data analysis indicate that the proposed distributed estimation and testing procedures have excellent properties in terms of estimation accuracy and stability as well as transmission efficiency.
{"title":"Distributed estimation of spiked eigenvalues in spiked population models","authors":"Lu Yan , Jiang Hu","doi":"10.1016/j.jmva.2025.105558","DOIUrl":"10.1016/j.jmva.2025.105558","url":null,"abstract":"<div><div>The proliferation of science and technology has led to the prevalence of voluminous data sets distributed across multiple machines. Conventional statistical methodologies may be infeasible in analyzing such massive data sets due to prohibitively long computing durations, memory constraints, communication overheads, and confidentiality considerations. In this paper, we propose distributed estimators of the spiked eigenvalues in spiked population models. The consistency and asymptotic normality of the distributed estimators are derived, and the statistical error analysis of the distributed estimators is also provided. Compared to the estimation from the full sample, the proposed distributed estimation shares the same order of convergence. Simulation study and real data analysis indicate that the proposed distributed estimation and testing procedures have excellent properties in terms of estimation accuracy and stability as well as transmission efficiency.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105558"},"PeriodicalIF":1.4,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145681899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1016/j.jmva.2025.105564
Wenzhi Yang , Chi Yao , Yiming Liu , Guangming Pan , Wang Zhou
In this paper, we consider the mean tests with high dimensional data and give two new tests which consist of three steps. Firstly, we reduce the high dimensional vectors into many low dimensional vectors and construct the Hotelling’s tests; Secondly, by the distribution or asymptotic distribution of these Hotelling’s tests under the null hypothesis, we transform these tests into uniform distribution or asymptotic uniform distribution random variables; Thirdly, the central limit theorems of the normalized sum of these transformations are obtained under the Gaussian case and non-Gaussian cases. Moreover, the asymptotic power of new test is also presented for non-Gaussian case. Compared to the existing tests, our tests not only have the good empirical sizes, but also have the high empirical powers.
{"title":"The mean tests with high dimensional data","authors":"Wenzhi Yang , Chi Yao , Yiming Liu , Guangming Pan , Wang Zhou","doi":"10.1016/j.jmva.2025.105564","DOIUrl":"10.1016/j.jmva.2025.105564","url":null,"abstract":"<div><div>In this paper, we consider the mean tests with high dimensional data and give two new tests which consist of three steps. Firstly, we reduce the high dimensional vectors into many low dimensional vectors and construct the Hotelling’s <span><math><msup><mrow><mi>T</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> tests; Secondly, by the distribution or asymptotic distribution of these Hotelling’s tests under the null hypothesis, we transform these tests into uniform distribution or asymptotic uniform distribution random variables; Thirdly, the central limit theorems of the normalized sum of these transformations are obtained under the Gaussian case and non-Gaussian cases. Moreover, the asymptotic power of new test is also presented for non-Gaussian case. Compared to the existing tests, our tests not only have the good empirical sizes, but also have the high empirical powers.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105564"},"PeriodicalIF":1.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145681894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1016/j.jmva.2025.105550
Liqi Xia , Ruiyuan Cao , Jiang Du , Ling Liu
This paper proposes a novel approach to enhance the versatility of the max-sum test in high-dimensional data analysis by combining two distinct rank correlation coefficients: Spearman’s and Chatterjee’s . We uncovered the independence between the max-type test and the sum-type test by deriving their joint distribution. This insight enables the development of a comprehensive max-sum test that tackles both sparse and dense alternative correlation structures in an adaptive manner. Leveraging the asymptotic independence between the two coefficients and the intrinsic highlights of two single-coefficient tests, we have strategically implemented Cauchy combination principles to devise a multifunctional testing methodology. This approach can accommodate monotonic and nonmonotonic data types and thus offers a versatile solution to a broad spectrum of analytical requirements. This versatility of our proposed method has been impressively demonstrated through a diverse range of simulation data studies and two real-world data analyses, underscoring its effectiveness and practical utility.
{"title":"Rank-based combination independence tests for high-dimensional data","authors":"Liqi Xia , Ruiyuan Cao , Jiang Du , Ling Liu","doi":"10.1016/j.jmva.2025.105550","DOIUrl":"10.1016/j.jmva.2025.105550","url":null,"abstract":"<div><div>This paper proposes a novel approach to enhance the versatility of the max-sum test in high-dimensional data analysis by combining two distinct rank correlation coefficients: Spearman’s <span><math><mi>ρ</mi></math></span> and Chatterjee’s <span><math><mi>ξ</mi></math></span>. We uncovered the independence between the max-type test and the sum-type test by deriving their joint distribution. This insight enables the development of a comprehensive max-sum test that tackles both sparse and dense alternative correlation structures in an adaptive manner. Leveraging the asymptotic independence between the two coefficients and the intrinsic highlights of two single-coefficient tests, we have strategically implemented Cauchy combination principles to devise a multifunctional testing methodology. This approach can accommodate monotonic and nonmonotonic data types and thus offers a versatile solution to a broad spectrum of analytical requirements. This versatility of our proposed method has been impressively demonstrated through a diverse range of simulation data studies and two real-world data analyses, underscoring its effectiveness and practical utility.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105550"},"PeriodicalIF":1.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145681889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1016/j.jmva.2025.105552
Shuangshuang Li , Jianbao Chen
Panel data collected from “locations” may exhibit spatial and serial correlations. In order to study such spatial and serial correlations, and possible existing nonlinear relationships, a fixed effects partially linear nonparametric panel regression model with separable spatially and serially correlated error structure is introduced. We obtain profile quasi-maximum likelihood estimators (PQMLEs) of the unknowns. Furthermore, a generalized F-test called is designed for assessing the reasonability of nonparametric component setting. Asymptotic properties of PQMLEs and are provided under several conditions. Monte Carlo trials imply our estimators and test statistic exhibit good performance in finite samples and model misspecification may lead to substantial influence on the estimates of unknown parameters. The analysis of provincial housing price in China reveals the presence of nonlinear, spatial and serial correlation relationships.
{"title":"Estimation and testing for fixed effects partially linear nonparametric panel regression model with separable spatially and serially correlated error structure","authors":"Shuangshuang Li , Jianbao Chen","doi":"10.1016/j.jmva.2025.105552","DOIUrl":"10.1016/j.jmva.2025.105552","url":null,"abstract":"<div><div>Panel data collected from “locations” may exhibit spatial and serial correlations. In order to study such spatial and serial correlations, and possible existing nonlinear relationships, a fixed effects partially linear nonparametric panel regression model with separable spatially and serially correlated error structure is introduced. We obtain profile quasi-maximum likelihood estimators (PQMLEs) of the unknowns. Furthermore, a generalized F-test called <span><math><msub><mrow><mi>F</mi></mrow><mrow><mi>N</mi><mi>T</mi></mrow></msub></math></span> is designed for assessing the reasonability of nonparametric component setting. Asymptotic properties of PQMLEs and <span><math><msub><mrow><mi>F</mi></mrow><mrow><mi>N</mi><mi>T</mi></mrow></msub></math></span> are provided under several conditions. Monte Carlo trials imply our estimators and test statistic exhibit good performance in finite samples and model misspecification may lead to substantial influence on the estimates of unknown parameters. The analysis of provincial housing price in China reveals the presence of nonlinear, spatial and serial correlation relationships.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105552"},"PeriodicalIF":1.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145681835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1016/j.jmva.2025.105571
Fang Xie , Lihu Xu , Qiuran Yao , Huiming Zhang
This paper investigates the distribution estimation of contaminated data using the MoM-GAN method, which leverages the power of generative adversarial nets (GANs) and median-of-means (MoM) estimation. Specifically, we use a deep neural network (DNN) with a ReLU activation function to model the generator and discriminator of the GAN. In terms of theoretical analysis, we derive a non-asymptotic error bound for the DNN-based MoM-GAN estimator, which is measured by integral probability metrics and takes into account the -smoothness Hölder class. The error bound essentially decreases in , where and are the sample size and the dimension of the input data, respectively. It provides a rigorous guarantee of the accuracy and robustness of the MoM-GAN estimator, even in the presence of contaminated data. We present an algorithm for the MoM-GAN method and demonstrate its effectiveness in two real-world applications. Our results show that the MoM-GAN method outperforms other competing methods when dealing with contaminated data, highlighting its superior performance and robustness.
{"title":"Statistical guarantees for distribution estimation of contaminated data via DNN-based MoM-GANs","authors":"Fang Xie , Lihu Xu , Qiuran Yao , Huiming Zhang","doi":"10.1016/j.jmva.2025.105571","DOIUrl":"10.1016/j.jmva.2025.105571","url":null,"abstract":"<div><div>This paper investigates the distribution estimation of contaminated data using the MoM-GAN method, which leverages the power of generative adversarial nets (GANs) and median-of-means (MoM) estimation. Specifically, we use a deep neural network (DNN) with a ReLU activation function to model the generator and discriminator of the GAN. In terms of theoretical analysis, we derive a non-asymptotic error bound for the DNN-based MoM-GAN estimator, which is measured by integral probability metrics and takes into account the <span><math><mi>b</mi></math></span>-smoothness Hölder class. The error bound essentially decreases in <span><math><mrow><msup><mrow><mi>n</mi></mrow><mrow><mo>−</mo><mi>b</mi><mo>/</mo><mi>p</mi></mrow></msup><mo>∨</mo><msup><mrow><mi>n</mi></mrow><mrow><mo>−</mo><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></mrow></math></span>, where <span><math><mi>n</mi></math></span> and <span><math><mi>p</mi></math></span> are the sample size and the dimension of the input data, respectively. It provides a rigorous guarantee of the accuracy and robustness of the MoM-GAN estimator, even in the presence of contaminated data. We present an algorithm for the MoM-GAN method and demonstrate its effectiveness in two real-world applications. Our results show that the MoM-GAN method outperforms other competing methods when dealing with contaminated data, highlighting its superior performance and robustness.</div></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"212 ","pages":"Article 105571"},"PeriodicalIF":1.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145681883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}