Pub Date : 2024-02-29DOI: 10.1007/s00362-024-01530-8
Abstract
In this paper, we propose a new scale-invariant test for linear hypothesis of mean vectors with heteroscedasticity in high-dimensional settings. Most existing tests impose strong conditions on covariance matrices so that null distributions of their tests are asymptotically normal, which restricts the application of test procedures. However, our proposed test has different null distributions under mild conditions. Additionally, the well-known Welch-Satterthwaite chi-square approximation we adopted can automatically mimic the shapes of the null distributions of the test statistic. The performances of the test are illustrated by simulation and real data in finite samples which show that it has robustness and is more powerful than three competitors.
{"title":"A scale-invariant test for linear hypothesis of means in high dimensions","authors":"","doi":"10.1007/s00362-024-01530-8","DOIUrl":"https://doi.org/10.1007/s00362-024-01530-8","url":null,"abstract":"<h3>Abstract</h3> <p>In this paper, we propose a new scale-invariant test for linear hypothesis of mean vectors with heteroscedasticity in high-dimensional settings. Most existing tests impose strong conditions on covariance matrices so that null distributions of their tests are asymptotically normal, which restricts the application of test procedures. However, our proposed test has different null distributions under mild conditions. Additionally, the well-known Welch-Satterthwaite chi-square approximation we adopted can automatically mimic the shapes of the null distributions of the test statistic. The performances of the test are illustrated by simulation and real data in finite samples which show that it has robustness and is more powerful than three competitors.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"46 22 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140001940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-26DOI: 10.1007/s00362-024-01529-1
Bruno Ebner, Norbert Henze, Simos Meintanis
We propose a general and relatively simple method to construct goodness-of-fit tests on the sphere and the hypersphere. The method is based on the characterization of probability distributions via their characteristic function, and it leads to test criteria that are convenient regarding applications and consistent against arbitrary deviations from the model under test. We emphasize goodness-of-fit tests for spherical distributions due to their importance in applications and the relative scarcity of available methods.
{"title":"A unified approach to goodness-of-fit testing for spherical and hyperspherical data","authors":"Bruno Ebner, Norbert Henze, Simos Meintanis","doi":"10.1007/s00362-024-01529-1","DOIUrl":"https://doi.org/10.1007/s00362-024-01529-1","url":null,"abstract":"<p>We propose a general and relatively simple method to construct goodness-of-fit tests on the sphere and the hypersphere. The method is based on the characterization of probability distributions via their characteristic function, and it leads to test criteria that are convenient regarding applications and consistent against arbitrary deviations from the model under test. We emphasize goodness-of-fit tests for spherical distributions due to their importance in applications and the relative scarcity of available methods.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"2 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139969034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1007/s00362-024-01528-2
Rauf Ahmad, Per Johansson, Mårten Schultzberg
The increasing computational power has led to an increasing interest in Fisher’s test in social science. As the Fisher and Neyman inference are based on different principles there is also an increasing interest in understanding the differential features of the two procedures. For example, Young (2018) found that the Fisher test has better size properties than the Neyman test in the situation with influential observations. Ding (2017), on the other hand, showed that the asymptotic variance of the mean-difference estimator (MDE) under Fisher inference is larger than that under Neyman inference, and that the asymptotic Fisher test is less powerful than the t-test even for the simplest case of homogeneous effect. Since MDE plays an important role for policy evaluation, these latter results are a concern for using Fisher’s test as argued in Young (2018). With the aim of providing an understanding of the usefulness of the exact Fisher test for inference to the sample and to the population, this paper clarifies the results in Ding (2017). Using a novel Monte Carlo simulation following the same data generating processes as in Ding (2017), we demonstrate that the Fisher test has no worse power properties than the t-test even with heterogeneous effects.
{"title":"Is Fisher inference inferior to Neyman inference for policy analysis?","authors":"Rauf Ahmad, Per Johansson, Mårten Schultzberg","doi":"10.1007/s00362-024-01528-2","DOIUrl":"https://doi.org/10.1007/s00362-024-01528-2","url":null,"abstract":"<p>The increasing computational power has led to an increasing interest in Fisher’s test in social science. As the Fisher and Neyman inference are based on different principles there is also an increasing interest in understanding the differential features of the two procedures. For example, Young (2018) found that the Fisher test has better size properties than the Neyman test in the situation with influential observations. Ding (2017), on the other hand, showed that the asymptotic variance of the mean-difference estimator (MDE) under Fisher inference is larger than that under Neyman inference, and that the asymptotic Fisher test is less powerful than the <i>t</i>-test even for the simplest case of homogeneous effect. Since MDE plays an important role for policy evaluation, these latter results are a concern for using Fisher’s test as argued in Young (2018). With the aim of providing an understanding of the usefulness of the exact Fisher test for inference to the sample and to the population, this paper clarifies the results in Ding (2017). Using a novel Monte Carlo simulation following the same data generating processes as in Ding (2017), we demonstrate that the Fisher test has no worse power properties than the t-test even with heterogeneous effects.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"70 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139921756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-14DOI: 10.1007/s00362-023-01523-z
Karim Benhenni, Ali Hajj Hassan, Yingcai Su
This article considers the problem of nonparametric estimation of the regression function (r) in a functional regression model (Y = r(X) +varepsilon ) with a scalar response Y, a functional explanatory variable X, and a second order stationary error process (varepsilon ). Under some specific criteria, we construct a local linear kernel estimator of (r) from functional random design with correlated errors. The exact rates of convergence of mean squared error of the constructed estimator are established for both short and long range dependent error processes. Simulation studies are conducted on the performance of the proposed simple local linear estimator. Examples of time series data are considered.
本文考虑的问题是在函数回归模型(Y = r(X) +varepsilon )中回归函数(r)的非参数估计,该模型具有标量响应 Y、函数解释变量 X 和二阶静态误差过程 (varepsilon)。在一些特定的标准下,我们从具有相关误差的函数随机设计中构建了一个局部线性核估计器((r))。在短程和长程依赖误差过程中,都建立了所建估计器均方误差的精确收敛率。对所提出的简单局部线性估计器的性能进行了仿真研究。考虑了时间序列数据的实例。
{"title":"The effect of correlated errors on the performance of local linear estimation of regression function based on random functional design","authors":"Karim Benhenni, Ali Hajj Hassan, Yingcai Su","doi":"10.1007/s00362-023-01523-z","DOIUrl":"https://doi.org/10.1007/s00362-023-01523-z","url":null,"abstract":"<p>This article considers the problem of nonparametric estimation of the regression function <span>(r)</span> in a functional regression model <span>(Y = r(X) +varepsilon )</span> with a scalar response <i>Y</i>, a functional explanatory variable <i>X</i>, and a second order stationary error process <span>(varepsilon )</span>. Under some specific criteria, we construct a local linear kernel estimator of <span>(r)</span> from functional random design with correlated errors. The exact rates of convergence of mean squared error of the constructed estimator are established for both short and long range dependent error processes. Simulation studies are conducted on the performance of the proposed simple local linear estimator. Examples of time series data are considered.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"208 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139762372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-17DOI: 10.1007/s00362-023-01525-x
Jinyu Zhou, Jigao Yan, Dongya Cheng
In this paper, strong consistency of tail value-at-risk (TVaR) estimator under widely orthant dependent (WOD) samples is established, and a numerical simulation is performed to verify the validity of the theoretical results. To reveal the essence of the result, theoretical discussion on complete and complete moment convergence corresponding to the Baum–Katz law, as well as the Marcinkiewicz–Zygmund type strong law of large numbers (MZSLLN) for maximal weighted sums and maximal product sums of widely orthant dependent (WOD) random variables are investigated. The results obtained in the context extend the corresponding ones for independent and some dependent random variables.
{"title":"Strong consistency of tail value-at-risk estimator and corresponding general results under widely orthant dependent samples","authors":"Jinyu Zhou, Jigao Yan, Dongya Cheng","doi":"10.1007/s00362-023-01525-x","DOIUrl":"https://doi.org/10.1007/s00362-023-01525-x","url":null,"abstract":"<p>In this paper, strong consistency of tail value-at-risk (TVaR) estimator under widely orthant dependent (WOD) samples is established, and a numerical simulation is performed to verify the validity of the theoretical results. To reveal the essence of the result, theoretical discussion on complete and complete moment convergence corresponding to the Baum–Katz law, as well as the Marcinkiewicz–Zygmund type strong law of large numbers (MZSLLN) for maximal weighted sums and maximal product sums of widely orthant dependent (WOD) random variables are investigated. The results obtained in the context extend the corresponding ones for independent and some dependent random variables.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139501214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-13DOI: 10.1007/s00362-023-01526-w
Weirong Li, Wensheng Zhu
The growing popularity of data heterogeneity motivates people to identify homogeneous subgroups with identical parameters. Meanwhile, in many fields of recent data science for some applications, such as personalized education and personalized marketing, the massive data are usually recorded as categorical or ordinal variables, which highlights the importance of performing subgroup analysis on those ordinal outcomes. In this paper, we propose a cumulative link model with subject-specific intercepts to detect and identify homogeneous subgroups through concave pairwise fusion penalty for ordinal response, where heterogeneity arises from some unknown or unobserved latent factors. The concave fusion method can simultaneously determine the number of subgroups, identify the group membership, and estimate the regression coefficients. An alternating direction method of multipliers algorithm with concave penalties for the generalized linear regression model with logit link is developed and its convergence property is studied. We also establish the oracle property of the proposed penalized estimator under some mild conditions. Our simulation studies show that the proposed method could recover the heterogeneous subgroup structure effectively when the response of interest is ordinal. Further, the advantages of our method are illustrated by the analysis on a Mathematics Student Performance Data Set of two public schools from the Alentejo region of Portugal.
{"title":"Subgroup analysis with concave pairwise fusion penalty for ordinal response","authors":"Weirong Li, Wensheng Zhu","doi":"10.1007/s00362-023-01526-w","DOIUrl":"https://doi.org/10.1007/s00362-023-01526-w","url":null,"abstract":"<p>The growing popularity of data heterogeneity motivates people to identify homogeneous subgroups with identical parameters. Meanwhile, in many fields of recent data science for some applications, such as personalized education and personalized marketing, the massive data are usually recorded as categorical or ordinal variables, which highlights the importance of performing subgroup analysis on those ordinal outcomes. In this paper, we propose a cumulative link model with subject-specific intercepts to detect and identify homogeneous subgroups through concave pairwise fusion penalty for ordinal response, where heterogeneity arises from some unknown or unobserved latent factors. The concave fusion method can simultaneously determine the number of subgroups, identify the group membership, and estimate the regression coefficients. An alternating direction method of multipliers algorithm with concave penalties for the generalized linear regression model with logit link is developed and its convergence property is studied. We also establish the oracle property of the proposed penalized estimator under some mild conditions. Our simulation studies show that the proposed method could recover the heterogeneous subgroup structure effectively when the response of interest is ordinal. Further, the advantages of our method are illustrated by the analysis on a Mathematics Student Performance Data Set of two public schools from the Alentejo region of Portugal.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"46 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-12DOI: 10.1007/s00362-023-01527-9
Jürgen Groß, Annette Möller
The size of the effect of the difference in two groups with respect to a variable of interest may be estimated by the classical Cohen’s d. A recently proposed generalized estimator allows conditioning on further independent variables within the framework of a linear regression model. In this note, it is demonstrated how unbiased estimation of the effect size parameter together with a corresponding standard error may be obtained based on the non-central t distribution. The portrayed estimator may be considered as a natural generalization of the unbiased Hedges’ g. In addition, confidence interval estimation for the unknown parameter is demonstrated by applying the so-called inversion confidence interval principle. The regarded properties collapse to already known ones in case of absence of any additional independent variables. The stated remarks are illustrated with a publicly available data set.
最近提出的一种广义估计方法允许在线性回归模型的框架内对更多的独立变量进行调节。在本说明中,我们将展示如何基于非中心 t 分布,对效应大小参数进行无偏估计,并得出相应的标准误差。所描绘的估计器可视为无偏 Hedges' g 的自然概括。此外,通过应用所谓的反转置信区间原理,还演示了未知参数的置信区间估计。在没有任何额外自变量的情况下,所考虑的特性与已知的特性相吻合。上述论述将通过一组公开数据加以说明。
{"title":"Some additional remarks on statistical properties of Cohen’s d in the presence of covariates","authors":"Jürgen Groß, Annette Möller","doi":"10.1007/s00362-023-01527-9","DOIUrl":"https://doi.org/10.1007/s00362-023-01527-9","url":null,"abstract":"<p>The size of the effect of the difference in two groups with respect to a variable of interest may be estimated by the classical Cohen’s <i>d</i>. A recently proposed generalized estimator allows conditioning on further independent variables within the framework of a linear regression model. In this note, it is demonstrated how unbiased estimation of the effect size parameter together with a corresponding standard error may be obtained based on the non-central <i>t</i> distribution. The portrayed estimator may be considered as a natural generalization of the unbiased Hedges’ <i>g</i>. In addition, confidence interval estimation for the unknown parameter is demonstrated by applying the so-called inversion confidence interval principle. The regarded properties collapse to already known ones in case of absence of any additional independent variables. The stated remarks are illustrated with a publicly available data set.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"17 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-09DOI: 10.1007/s00362-023-01524-y
Frédéric Ouimet
The multivariate inverse hypergeometric (MIH) distribution is an extension of the negative multinomial (NM) model that accounts for sampling without replacement in a finite population. Even though most studies on longitudinal count data with a specific number of ‘failures’ occur in a finite setting, the NM model is typically chosen over the more accurate MIH model. This raises the question: How much information is lost when inferring with the approximate NM model instead of the true MIH model? The loss is quantified by a measure called deficiency in statistics. In this paper, asymptotic bounds for the deficiencies between MIH and NM experiments are derived, as well as between MIH and the corresponding multivariate normal experiments with the same mean-covariance structure. The findings are supported by a local approximation for the log-ratio of the MIH and NM probability mass functions, and by Hellinger distance bounds.
{"title":"Deficiency bounds for the multivariate inverse hypergeometric distribution","authors":"Frédéric Ouimet","doi":"10.1007/s00362-023-01524-y","DOIUrl":"https://doi.org/10.1007/s00362-023-01524-y","url":null,"abstract":"<p>The multivariate inverse hypergeometric (MIH) distribution is an extension of the negative multinomial (NM) model that accounts for sampling without replacement in a finite population. Even though most studies on longitudinal count data with a specific number of ‘failures’ occur in a finite setting, the NM model is typically chosen over the more accurate MIH model. This raises the question: How much information is lost when inferring with the approximate NM model instead of the true MIH model? The loss is quantified by a measure called deficiency in statistics. In this paper, asymptotic bounds for the deficiencies between MIH and NM experiments are derived, as well as between MIH and the corresponding multivariate normal experiments with the same mean-covariance structure. The findings are supported by a local approximation for the log-ratio of the MIH and NM probability mass functions, and by Hellinger distance bounds.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"40 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139409125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-04DOI: 10.1007/s00362-023-01520-2
Abstract
Instead of applying the commonly used parametric Almon or Beta lag distribution of MIDAS, Breitung and Roling (J Forecast 34:588–603, 2015) suggested a nonparametric smoothed least-squares shrinkage estimator (henceforth ({SLS}_{1})) for estimating mixed-frequency models. This ({SLS}_{1}) approach ensures a flexible smooth trending lag distribution. However, even if the biasing parameter in ({SLS}_{1}) solves the overparameterization problem, the cost is a decreased goodness-of-fit. Therefore, we suggest a modification of this shrinkage regression into a two-parameter smoothed least-squares estimator (({SLS}_{2})). This estimator solves the overparameterization problem, and it has superior properties since it ensures that the orthogonality assumption between residuals and the predicted dependent variable holds, which leads to an increased goodness-of-fit. Our theoretical comparisons, supported by simulations, demonstrate that the increase in goodness-of-fit of the proposed two-parameter estimator also leads to a decrease in the mean square error of ({SLS}_{2},) compared to that of ({SLS}_{1}). Empirical results, where the inflation rate is forecasted based on the oil returns, demonstrate that our proposed ({SLS}_{2}) estimator for mixed-frequency models provides better estimates in terms of decreased MSE and improved R2, which in turn leads to better forecasts.
{"title":"Improved Breitung and Roling estimator for mixed-frequency models with application to forecasting inflation rates","authors":"","doi":"10.1007/s00362-023-01520-2","DOIUrl":"https://doi.org/10.1007/s00362-023-01520-2","url":null,"abstract":"<h3>Abstract</h3> <p>Instead of applying the commonly used parametric Almon or Beta lag distribution of MIDAS, Breitung and Roling (J Forecast 34:588–603, 2015) suggested a nonparametric smoothed least-squares shrinkage estimator (henceforth <span> <span>({SLS}_{1})</span> </span>) for estimating mixed-frequency models. This <span> <span>({SLS}_{1})</span> </span> approach ensures a flexible smooth trending lag distribution. However, even if the biasing parameter in <span> <span>({SLS}_{1})</span> </span> solves the overparameterization problem, the cost is a decreased goodness-of-fit. Therefore, we suggest a modification of this shrinkage regression into a two-parameter smoothed least-squares estimator (<span> <span>({SLS}_{2})</span> </span>). This estimator solves the overparameterization problem, and it has superior properties since it ensures that the orthogonality assumption between residuals and the predicted dependent variable holds, which leads to an increased goodness-of-fit. Our theoretical comparisons, supported by simulations, demonstrate that the increase in goodness-of-fit of the proposed two-parameter estimator also leads to a decrease in the mean square error of <span> <span>({SLS}_{2},)</span> </span> compared to that of <span> <span>({SLS}_{1})</span> </span>. Empirical results, where the inflation rate is forecasted based on the oil returns, demonstrate that our proposed <span> <span>({SLS}_{2})</span> </span> estimator for mixed-frequency models provides better estimates in terms of decreased MSE and improved R<sup>2</sup>, which in turn leads to better forecasts.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"15 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139104723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1007/s00362-023-01521-1
Yan-ni Jhan, Wan-cen Li, Shin-hui Ruan, Jia-jyun Sie, Iebin Lian
Despite criticism for loss of information and power, dichotomization of variables is still frequently used in social, behavioral, and medical sciences, mainly because it yields more interpretable conclusions for research outcomes and is useful for decision making. However, the artificial choice of cut-points can be controversial and needs proper justification. In this work, we investigate the properties of point-biserial correlation after dichotomization with underlying bimodal Gaussian mixture distributions. We propose a dichotomous grouping procedure that considers the largest standardized difference in group mean while minimizing information loss.
{"title":"Optimal dichotomization of bimodal Gaussian mixtures","authors":"Yan-ni Jhan, Wan-cen Li, Shin-hui Ruan, Jia-jyun Sie, Iebin Lian","doi":"10.1007/s00362-023-01521-1","DOIUrl":"https://doi.org/10.1007/s00362-023-01521-1","url":null,"abstract":"<p>Despite criticism for loss of information and power, dichotomization of variables is still frequently used in social, behavioral, and medical sciences, mainly because it yields more interpretable conclusions for research outcomes and is useful for decision making. However, the artificial choice of cut-points can be controversial and needs proper justification. In this work, we investigate the properties of point-biserial correlation after dichotomization with underlying bimodal Gaussian mixture distributions. We propose a dichotomous grouping procedure that considers the largest standardized difference in group mean while minimizing information loss.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"21 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139078896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}