Pub Date : 2024-03-01DOI: 10.1007/s00180-024-01475-4
Mathias von Ottenbreit, Riccardo De Bin
Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.
回归建模通常需要在预测性和可解释性之间做出权衡。高预测性和流行的基于树的算法(如随机森林和提升树)能很好地预测新观测结果,但预测因子对结果的影响却难以解释。另一方面,基于线性效应的提升算法和 MARS 等可解释性强的算法通常预测性较差。在这里,我们提出了一种新型回归算法--自动分片线性回归(APLR),它结合了提升算法的预测性和 MARS 模型的可解释性。此外,作为一种提升算法,它能自动处理变量选择;作为一种基于 MARS 的方法,它能考虑到非线性关系和可能的交互项。我们在模拟和真实数据示例中展示了 APLR 在预测方面的性能如何与表现最佳的方法相媲美,同时还提供了解释结果的简便方法。APLR 是用 C++ 实现的,并封装在一个 Python 软件包中,作为 Scikit-learn 兼容的估计器。
{"title":"Automatic piecewise linear regression","authors":"Mathias von Ottenbreit, Riccardo De Bin","doi":"10.1007/s00180-024-01475-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01475-4","url":null,"abstract":"<p>Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-24DOI: 10.1007/s00180-024-01470-9
Larissa C. Alves, Ronaldo Dias, Helio S. Migon
This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.
{"title":"Variational Bayesian Lasso for spline regression","authors":"Larissa C. Alves, Ronaldo Dias, Helio S. Migon","doi":"10.1007/s00180-024-01470-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01470-9","url":null,"abstract":"<p>This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"611 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139956295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.
{"title":"Bayesian estimation of the number of species from Poisson-Lindley stochastic abundance model using non-informative priors","authors":"Anurag Pathak, Manoj Kumar, Sanjay Kumar Singh, Umesh Singh, Sandeep Kumar","doi":"10.1007/s00180-024-01464-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01464-7","url":null,"abstract":"<p>In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-23DOI: 10.1007/s00180-024-01468-3
Takayuki Umeda
Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.
{"title":"Generation of normal distributions revisited","authors":"Takayuki Umeda","doi":"10.1007/s00180-024-01468-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01468-3","url":null,"abstract":"<p>Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"32 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-21DOI: 10.1007/s00180-024-01466-5
Luca Pedini
This article presents the gretl package BayTool which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of gretl scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make BayTool stand out as an excellent teaching device without sacrificing more advanced or complex applications.
{"title":"Bayesian regression models in gretl: the BayTool package","authors":"Luca Pedini","doi":"10.1007/s00180-024-01466-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01466-5","url":null,"abstract":"<p>This article presents the <span>gretl</span> package <span>BayTool</span> which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of <span>gretl</span> scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make <span>BayTool</span> stand out as an excellent teaching device without sacrificing more advanced or complex applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1007/s00180-024-01458-5
Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra
Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.
当相关疾病的发病率较低时,设计疫苗效力(VE)试验往往需要招募大量参与者。在开发新型疫苗(如 COVID-19 疾病)时,在设计阶段 VE 的合理范围相当大。因此,要证明疗效超过预先设定的监管阈值所需的事件数量可能难以预测,而积累必要事件所需的时间往往很长。因此,在试验的早期中期分析中对疗效进行评估是很有好处的,这样有可能使试验因VE过高或无效而提前结束。在这种情况下,通过使用序贯概率比检验(SPRT)进行中期分析有助于进行多重分析,同时控制 I 型和 II 型误差。在本文中,我们提出了一种贝叶斯概率比检验方法,用于设计疫苗试验,在假设两种泊松发病率的情况下比较试验疫苗和对照疫苗。我们就如何选择先验分布以及如何优化中期分析的事件数以最大限度地提高设计效率提供了指导。通过模拟,我们展示了所提出的贝叶斯 SPRT 与相应的频数 SPRT 相比如何表现得更好。实现所提方法的 R 代码库位于:https://github.com/Merck/bayesiansprt。
{"title":"Bayesian sequential probability ratio test for vaccine efficacy trials","authors":"Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra","doi":"10.1007/s00180-024-01458-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01458-5","url":null,"abstract":"<p>Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-19DOI: 10.1007/s00180-024-01457-6
Claudio Conversano, Luca Frigau, Giulia Contu
Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.
{"title":"Overlapping coefficient in network-based semi-supervised clustering","authors":"Claudio Conversano, Luca Frigau, Giulia Contu","doi":"10.1007/s00180-024-01457-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01457-6","url":null,"abstract":"<p>Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1007/s00180-024-01462-9
Xing Liu, Weihua Deng
This paper discusses the first exit and Dirichlet problems of the nonisotropic tempered (alpha)-stable process (X_t). The upper bounds of all moments of the first exit position (left| X_{tau _D}right|) and the first exit time (tau _D) are explicitly obtained. It is found that the probability density function of (left| X_{tau _D}right|) or (tau _D) exponentially decays with the increase of (left| X_{tau _D}right|) or (tau _D), and (mathrm{E}left[ tau _Dright] sim mathrm{E}left[ left| X_{tau _D}-mathrm{E}left[ X_{tau _D}right] right| ^2right]), (mathrm{E}left[ tau _Dright] sim left| mathrm{E}left[ X_{tau _D}right] right|). Next, we obtain the Feynman–Kac representation of the Dirichlet problem by employing the semigroup theory. Furthermore, averaging the generated trajectories of the stochastic process leads to the solution of the Dirichlet problem, which is also verified by numerical experiments.
{"title":"First exit and Dirichlet problem for the nonisotropic tempered $$alpha$$ -stable processes","authors":"Xing Liu, Weihua Deng","doi":"10.1007/s00180-024-01462-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01462-9","url":null,"abstract":"<p>This paper discusses the first exit and Dirichlet problems of the nonisotropic tempered <span>(alpha)</span>-stable process <span>(X_t)</span>. The upper bounds of all moments of the first exit position <span>(left| X_{tau _D}right|)</span> and the first exit time <span>(tau _D)</span> are explicitly obtained. It is found that the probability density function of <span>(left| X_{tau _D}right|)</span> or <span>(tau _D)</span> exponentially decays with the increase of <span>(left| X_{tau _D}right|)</span> or <span>(tau _D)</span>, and <span>(mathrm{E}left[ tau _Dright] sim mathrm{E}left[ left| X_{tau _D}-mathrm{E}left[ X_{tau _D}right] right| ^2right])</span>, <span>(mathrm{E}left[ tau _Dright] sim left| mathrm{E}left[ X_{tau _D}right] right|)</span>. Next, we obtain the Feynman–Kac representation of the Dirichlet problem by employing the semigroup theory. Furthermore, averaging the generated trajectories of the stochastic process leads to the solution of the Dirichlet problem, which is also verified by numerical experiments.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-13DOI: 10.1007/s00180-024-01463-8
Wolfgang Kössler, Hans-J. Lenz, Xing D. Wang
The Benford law is used world-wide for detecting non-conformance or data fraud of numerical data. It says that the significand of a data set from the universe is not uniformly, but logarithmically distributed. Especially, the first non-zero digit is One with an approximate probability of 0.3. There are several tests available for testing Benford, the best known are Pearson’s (chi ^2)-test, the Kolmogorov–Smirnov test and a modified version of the MAD-test. In the present paper we propose some tests, three of the four invariant sum tests are new and they are motivated by the sum invariance property of the Benford law. Two distance measures are investigated, Euclidean and Mahalanobis distance of the standardized sums to the orign. We use the significands corresponding to the first significant digit as well as the second significant digit, respectively. Moreover, we suggest inproved versions of the MAD-test and obtain critical values that are independent of the sample sizes. For illustration the tests are applied to specifically selected data sets where prior knowledge is available about being or not being Benford. Furthermore we discuss the role of truncation of distributions.
{"title":"Some new invariant sum tests and MAD tests for the assessment of Benford’s law","authors":"Wolfgang Kössler, Hans-J. Lenz, Xing D. Wang","doi":"10.1007/s00180-024-01463-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01463-8","url":null,"abstract":"<p>The Benford law is used world-wide for detecting non-conformance or data fraud of numerical data. It says that the significand of a data set from the universe is not uniformly, but logarithmically distributed. Especially, the first non-zero digit is One with an approximate probability of 0.3. There are several tests available for testing Benford, the best known are Pearson’s <span>(chi ^2)</span>-test, the Kolmogorov–Smirnov test and a modified version of the MAD-test. In the present paper we propose some tests, three of the four invariant sum tests are new and they are motivated by the sum invariance property of the Benford law. Two distance measures are investigated, Euclidean and Mahalanobis distance of the standardized sums to the orign. We use the significands corresponding to the first significant digit as well as the second significant digit, respectively. Moreover, we suggest inproved versions of the MAD-test and obtain critical values that are independent of the sample sizes. For illustration the tests are applied to specifically selected data sets where prior knowledge is available about being or not being Benford. Furthermore we discuss the role of truncation of distributions.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"170 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1007/s00180-024-01465-6
Yi Wu, Wei Wang, Xuejun Wang
Let ({X_{i},1le ile n}) be a sequence of linear process based on dependent random variables with random coefficients, which has a mean shift at an unknown location. The cumulative sum (CUSUM, for short) estimator of the change point is studied. The strong convergence, (L_{r}) convergence, complete convergence and the rate of strong convergence are established for the CUSUM estimator under some mild conditions. These results improve and extend the corresponding ones in the literature. Simulation studies and two real data examples are also provided to support the theoretical results.
设({X_{i},1le ile n} )是一个基于因变量的线性过程序列,具有随机系数,在未知位置有均值移动。研究了变化点的累积和(简称 CUSUM)估计器。在一些温和的条件下,建立了 CUSUM 估计器的强收敛性、(L_{r})收敛性、完全收敛性和强收敛率。这些结果改进并扩展了文献中的相应结果。此外,还提供了仿真研究和两个真实数据实例来支持理论结果。
{"title":"Convergence of the CUSUM estimation for a mean shift in linear processes with random coefficients","authors":"Yi Wu, Wei Wang, Xuejun Wang","doi":"10.1007/s00180-024-01465-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01465-6","url":null,"abstract":"<p>Let <span>({X_{i},1le ile n})</span> be a sequence of linear process based on dependent random variables with random coefficients, which has a mean shift at an unknown location. The cumulative sum (CUSUM, for short) estimator of the change point is studied. The strong convergence, <span>(L_{r})</span> convergence, complete convergence and the rate of strong convergence are established for the CUSUM estimator under some mild conditions. These results improve and extend the corresponding ones in the literature. Simulation studies and two real data examples are also provided to support the theoretical results.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}