Data privacy is a growing concern in modern data analyses as more and more types of information about individuals are collected and shared. Statistical analysis in consideration of privacy is thus becoming an exciting area of research. Differential privacy can provide a means by which one can measure the stochastic risk of violating the privacy of individuals that can result from conducting an analysis, such as a simple query from a database and a hypothesis test. The main interest of the work is a goodness-of-fit test that compares the sampled data to a known distribution. Many differentially private goodness-of-fit tests have been proposed for discrete random variables, but little work has been done for continuous variables. The objective is to review some existing tests that guarantee differential privacy for discrete random variables, and to propose an extension to continuous cases via a discretization process. The proposed test procedures are demonstrated through simulated examples and applied to the Household Financial Welfare Survey of South Korea in 2018.
Binary classification is an important issue in many applications but mostly studied for independent data in the literature. A binary time series classification is investigated by proposing a semiparametric procedure named “Model Averaging nonlinear MArginal LOgistic Regressions” (MAMaLoR) for binary time series data based on the time series information of predictor variables. The procedure involves approximating the logistic multivariate conditional regression function by combining low-dimensional non-parametric nonlinear marginal logistic regressions, in the sense of Kullback-Leibler distance. A time series conditional likelihood method is suggested for estimating the optimal averaging weights together with local maximum likelihood estimations of the nonparametric marginal time series logistic (auto)regressions. The asymptotic properties of the procedure are established under mild conditions on the time series observations that are of -mixing property. The procedure is less computationally demanding and can avoid the “curse of dimensionality” for, and be easily applied to, high dimensional lagged information based nonlinear time series classification forecasting. The performances of the procedure are further confirmed both by Monte-Carlo simulation and an empirical study for market moving direction forecasting of the financial FTSE 100 index data.
The identification and estimation of conditional quantile functions for count responses using longitudinal data are considered. The approach is based on a continuous approximation to distribution functions for count responses within a class of parametric models that are commonly employed. It is first shown that conditional quantile functions for count responses are identified in zero-inflated models with subject heterogeneity. Then, a simple three-step approach is developed to estimate the effects of covariates on the quantiles of the response variable. A simulation study is presented to show the small sample performance of the estimator. Finally, the advantages of the proposed estimator in relation to some existing methods is illustrated by estimating a model of annual visits to physicians using data from a health insurance experiment.
In an extensive pseudo out-of-sample horserace, classical estimators (dynamic factor models, RIDGE and partial least squares regression) and the novel to forecasting, Regularized Sliced Inverse Regression, exhibit almost near-equivalent forecasting accuracy in a large panel of macroeconomic variables across targets, horizons and subsamples. This finding motivates the theoretical contributions in this paper. Most widely used linear dimension reduction methods are shown to solve closely related maximization problems with solutions that can be decomposed in signal and scaling components. They are organized under a common scheme that sheds light on their commonalities and differences as well as on their functionality. Regularized Sliced Inverse Regression delivers the most parsimonious forecast model and obtains the greatest reduction of the complexity of the forecasting problem. Nevertheless, the study’s findings are that (a) the intrinsic relationship between forecast target and the other macroseries in the panel is linear and (b) targeting contributes in reducing the complexity of modeling yet does not induce significant gains in macroeconomic forecasting accuracy.
Autoregressive models are reviewed for the analysis of multivariate count time series. A particular topic of interest which is discussed in detail is that of the choice of a suitable distribution for a vectors of count random variables. The focus is on three main approaches taken for multivariate count time series analysis: (a) integer autoregressive processes, (b) parameter-driven models and (c) observation-driven models. The aim is to highlight some recent methodological developments and propose some potentially useful research topics.
The sum of a random number of independent and identically distributed random vectors has a distribution which is not analytically tractable, in the general case. The problem has been addressed by means of asymptotic approximations embedding the number of summands in a stochastically increasing sequence. Another approach relies on fitting flexible and tractable parametric, multivariate distributions, as for example finite mixtures. Both approaches are investigated within the framework of Edgeworth expansions. A general formula for the fourth-order cumulants of the random sum of independent and identically distributed random vectors is derived and it is shown that the above mentioned asymptotic approach does not necessarily lead to valid asymptotic normal approximations. The problem is addressed by means of Edgeworth expansions. Both theoretical and empirical results suggest that mixtures of two multivariate normal distributions with proportional covariance matrices satisfactorily fit data generated from random sums where the counting random variable and the random summands are Poisson and multivariate skew-normal, respectively.